Jump to content
  • 0

(Archived) Searchable pdfs: Evernote's servers or my scanner's OCR?


huladaddy

Idea

Question 1: Am I better off using the OCR capabilities of my scanner (ScanSnap S1300) to make my pdf searchable, or letting Evernote's servers perform their magic. Which method will provide the best results?

Question 2: If I already have searchable pdfs that were created by my scanner, is there a way to delete the OCR data and force them to be reread by Evernote's recognition servers?

Link to comment

5 replies to this idea

Recommended Posts

I have been wondering about the same. I have not reached a decision yet, either.

It's hard to tell which will be most accurate. But with big volumes / fast action needed, it is nice to be able to just shoot the docs off to Evernote and not wait for the computer to complete OCR.

One benefit of local OCR though, especially for big documents, is that when you open them from Evernote later on, you are able to search within the file. Also, you are able to select text and cut/paste it somewhere else. (Unfortunately EN does not insert its index data into the documents).

For big stuff like manuals, ebooks, I have performed local OCR afterwards, by opening the attachment in Acrobat 9.

If you have Acrobat 9 (I got it with my ScanSnap S1500), you can delete the OCR data by going to Document > Examine Document.

You will see an item "Hidden text". This can be removed.

However, the more interesting question (for me) is: what does EN actually do when this info is already there? Does it use it at all (it seems so, from what I can tell)?. And if so, is it complimentary to it's own indexing or does it overrule it?

Also, if using local OCR, I wonder if Evernote search will respond to this, in case one reverts to a free account at some point, or move the documents to an alternative, free account. I signed up to premium quite fast and didn't get to experiment before this. Anybody?

Link to comment

To the original poster: It depends on how good/fast your own OCR software is and on your workflow. On their servers, Evernote uses top-notch OCR on PDFs, so I would expect about the same quality of locally OCRed vs. Evernote-OCRed results. A downside of Evernote's way of doing it is that, once you move the PDF-attachment out of Evernote, you lose the searchability. That's why I stick with local OCR for now, even if it makes the workflow a bit more clunky (having to wait for OCR to finish before being able to scan the next document).

To tjeef: When Evernote finds that a scanned PDF already has the OCRed text in it, it leaves them alone and doesn't do OCR on them. It still supports search and text highlighting fine within these PDF attachments, even if the PDF has been OCRed by another application (tested on Evernote/Mac). In any case, keep in mind that Evernote doesn't apply handwriting recognition to PDFs.

If you revert to free usage of Evernote, all your data (including OCR information by yourself or Evernote) stays where it is, and you can still access and search everything.

Migration to a different Evernote account is another matter. I don't know if there's a way to migrate notes to a different account while keeping all metadata.

Please note that you don't have to abandon your existing account in order to move from paid to free. Just stop your subscription, and keep using the free model, and when you change your mind, just start the subscription again.

Link to comment
  • Level 5*

I agree that your choice depends on the speed of your OCR and the effect that this has on your workflow; plus the convenience of being able to search within PDFs immediately rather than waiting for Evernote - which is quick, but not immediate. I use an S1500 and local OCR, which slows down throughput a little, but I use the OCR interval on the last document to get the paperwork ready for the next scan, so there's no real perceptible delay. Plus some of my PDFs are passworded, or attached as icons rather than open files, so the main content won't be OCR'd or indexed on the server. I use headings, tags and summaries to find these PDFs, then search within them myself. I deal in financial, legal, insurance and IT stuff all of which have their own special jargon, so if there's a need to recognise technical terms I'd rather the dictionary was kept local and within my control. All very subjective.

Link to comment

One benefit of local OCR though, especially for big documents, is that when you open them from Evernote later on, you are able to search within the file. Also, you are able to select text and cut/paste it somewhere else. (Unfortunately EN does not insert its index data into the documents).

Is this true?

I understand that if you right-click on a pdf and "save as..." that it will save the hidden text with the pdf.

I noticed that if I open a pdf in Preview that has been OCR'ed by Evernote I am able to select and search text just fine.

Link to comment

One benefit of local OCR though, especially for big documents, is that when you open them from Evernote later on, you are able to search within the file. Also, you are able to select text and cut/paste it somewhere else. (Unfortunately EN does not insert its index data into the documents).

Is this true?

I understand that if you right-click on a pdf and "save as..." that it will save the hidden text with the pdf.

I noticed that if I open a pdf in Preview that has been OCR'ed by Evernote I am able to select and search text just fine.

I just tested this and I'm afraid my results seem to contradict yours. I tested with two different PDFs which I had created with JotNot for the iPhone (which creates PDFs from photographs, but does not perform OCR) and imported in Evernote. These have been OCRed by Evernote, which I can verify by searching for text within the note, and the results get highlighted fine in the inlined PDF.

When I open the PDF in Preview, however, there is no text whatsoever. A quick select-all, copy, paste in TextEdit just yields a few line breaks, and the same text searches within Preview don't yield any results at all.

Evernote allows you to save a searchable copy of your PDF, as well. But this copy only contains the recognized text chunks, and does not contain the original images/scans from the pages. So Evernote's OCR gives you either get searchable PDFs, OR PDFs which are true representation of your scans, but not both in the same PDF file.

Are you absolutely positive that the PDF in question has not been not OCRed by some software?

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...