Jump to content

(Archived) .enex -> Searchable Image PDF Functionality


Recommended Posts

Evernote offers a premium feature whereby a pdf is OCR'd online and then is searchable within the client. Because Evernote does this far more quickly and easily than any desktop OCR software, I imported many old, scanned research documents into Evernote. It was nice to know that these documents would be made searchable.

From time to time, I also made use of a different aspect of that same feature, whereby you could export a searchable PDF image from the client. That way, for example, I could export searchable PDFs to Mendeley to gather bibliographic information, or send them to colleagues.

At some point, the feature was changed so that the searchable PDF exportable from the premium client was text-only. That meant it lacks images or equations, making it nearly useless for me when loaded in programs other than Evernote. I was told by Evernote support that this change is due to file sizes -- and it is true that the text-only PDF is much smaller than the original.

It'd be nice if the format of the searchable PDF were a configurable option. Transferring large searchable PDFs could be counted against our monthly usage, if we choose.

In the mean time, I wrote a workaround using Python to extract searchable image PDFs from exported Evernote archives. It's crude -- it searches an exported .enex archive for notes with pdf resource data and alternate resource data (where the searchable pdf is stored). For those resources, the program composites the text pdf and the original pdf, in that order, and saves the resulting pdf to a new file with the original filename. This produces a searchable image pdf that appears to work in Preview and Adobe Acrobat, AFAIK.

I've attached the python script. The syntax is "./enexSearchablePDFs.py input.enex [includeMask]" .

You'll need to install pyPdf from http://pybrary.net/pyPdf/ and ElementTree (for XML parsing) from http://effbot.org/zone/element-index.htm . Specify 'includeMask' if some pdfs have transparency so you need to actually hide the searchable text (a crude hack that works for some journal articles where the first page is transparent).

(The program checks to make sure the output pdf file doesn't exist already before writing. Though I've done my best to make it work passably well and without risk to your system, I offer this AS IS to the public domain, without any other license or explicit or implied warranty, etc.)

enexSearchablePDFs.py.zip

Link to comment
  • 3 years later...

I'm curious about this.  Given the situation where we can't choose which notebooks are synchronized with a particular client, I'm thinking about how to archive content so I can keep certain machines lean.  The idea of keeping the older content on a machine of my own is intriguing but it has to be searchable.  So far I tried exporting an EN archive and find that mac's spotlight doesn't find anything inside of it.  Exporting as html doesn't appear to give me OCR'd searchable words either.  It sounds like your solution might give me this but I have to wonder if there isn't something still missing.  What I've read about the EN OCR and index system makes it sound like it does a lot of fancy stuff.  Are all the potential key words included in the enex file?  Does a spotlight search yield the same thing as a search in the evernote client?   

btw, EN team.  While experimenting I accidentally double clicked an enex file and it immediately imported into Evernote.  There was no prompt asking me if I was  sure I wanted to do that and no way to stop it once it was going.  That seems a bit dangerous depending on how big the file is.    

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...