PortalTransforms: PDF with Images
The default pdf_to_html tranform from PortalTransforms only reads out the HTML-Code generated by the pdftohtml programm. Also loading the images is not much work – the office converter already does it.
class pdf_to_html(commandtransform):
...
def convert(self, data, cache, **kwargs):
name = self.name()
tmpdir, fullname = self.initialize_tmpdir(data, filename=name)
target_name = '%s/%s.html' % (tmpdir, name)
command = 'cd "%s" && pdftohtml -noframes -enc UTF-8 %s %s' % (
tmpdir, fullname, target_name)
log('PortalTransforms: Calling %s' % command)
os.system(command)
html = self.html(target_name)
path, images = self.subObjects(tmpdir)
objects = {}
if images:
self.fixImages(path, images, objects)
self.cleanDir(tmpdir)
cache.setData(html)
cache.setSubObjects(objects)
return cache
def html(self, html_file_name):
htmlfile = file(html_file_name, 'r')
html = htmlfile.read()
htmlfile.close()
html = scrubHTML(html)
body = bodyfinder(html)
return body
The whole file is available on gocept's subversion server as part of glome.
To enable the transform you point your browser to the portal_transforms of your Plone site. Delete the pdf_to_html transform and add a new transform. Use pdf_to_html as Id. The module name depends on where you installed the transform. In the case of glome it would be Products.glome.transforms.pdf_to_html.
Note, that there is a bug in the PortalTransforms which ships with Plone 2.0.5 which doesn't unregister the transform if you delete it.