cyberbryce 0 Posted February 1, 2010 Share Posted February 1, 2010 Evernote offers a premium feature whereby a pdf is OCR'd online and then is searchable within the client. Because Evernote does this far more quickly and easily than any desktop OCR software, I imported many old, scanned research documents into Evernote. It was nice to know that these documents would be made searchable.From time to time, I also made use of a different aspect of that same feature, whereby you could export a searchable PDF image from the client. That way, for example, I could export searchable PDFs to Mendeley to gather bibliographic information, or send them to colleagues.At some point, the feature was changed so that the searchable PDF exportable from the premium client was text-only. That meant it lacks images or equations, making it nearly useless for me when loaded in programs other than Evernote. I was told by Evernote support that this change is due to file sizes -- and it is true that the text-only PDF is much smaller than the original.It'd be nice if the format of the searchable PDF were a configurable option. Transferring large searchable PDFs could be counted against our monthly usage, if we choose.In the mean time, I wrote a workaround using Python to extract searchable image PDFs from exported Evernote archives. It's crude -- it searches an exported .enex archive for notes with pdf resource data and alternate resource data (where the searchable pdf is stored). For those resources, the program composites the text pdf and the original pdf, in that order, and saves the resulting pdf to a new file with the original filename. This produces a searchable image pdf that appears to work in Preview and Adobe Acrobat, AFAIK.I've attached the python script. The syntax is "./enexSearchablePDFs.py input.enex [includeMask]" .You'll need to install pyPdf from http://pybrary.net/pyPdf/ and ElementTree (for XML parsing) from http://effbot.org/zone/element-index.htm . Specify 'includeMask' if some pdfs have transparency so you need to actually hide the searchable text (a crude hack that works for some journal articles where the first page is transparent). (The program checks to make sure the output pdf file doesn't exist already before writing. Though I've done my best to make it work passably well and without risk to your system, I offer this AS IS to the public domain, without any other license or explicit or implied warranty, etc.)enexSearchablePDFs.py.zip Link to comment
This topic is now archived and is closed to further replies.