Jump to content
  • 0

(Archived) How does Evernote handle OCR?


grekko

Idea

Hey guys,

I am using Evernote as a document trunk for my private analog mail. I started this by using OCRx and manually Indexing the scanned PDF documents. Since I am now a premium member I wanted to know about EN capabilities for PDF OCR/indexing.

Does Evernote execute the PCR process on the server side?

(Here) spg SCOTT points out, that the OCR is done on the server and it needs a magical span of time until its synced back to the client?

Is the result of OCR/Indexing put back inside the source PDF?

(Here) jbenson2 says "The Evernote OCR process only works in Evernote. Once the PDF is pulled and sent elsewhere it is no longer searchable." Is that true?

Sincerely,

Gregory

Link to comment

24 replies to this idea

Recommended Posts

  • Level 5*
Is there a way to copy some or all of that searchable text and paste it into another document as editable text?

Not easily, no. Evernote does not create a stream of contiguous words to match whatever text it finds. Rather, it creates a set of guesses at what words exist in the image, each with its location and extents. These word guesses are stored individually along with the note. You can see what they are by exporting the note to Evernote format and pulling the resultant .enex file into a text editor. You'd see them as character data in <t> items under the <recoIndex> section, down towards the end of the note.

Link to comment
  • Level 5*

I am referring back to an earlier cross-referenced post by Heather on OCR limitations as well as the knowledgebase.  Evernote apparently now will OCR a file up to 50 MB (instead of 25) and 100 pages.  My question is what happens when a large file has already been OCR'd before it is added to Evernote.  Specifically, will Evernote index the text and make it searchable?

Hi. Welcome to the forums. My understanding is that Evernote will not re-OCR something if it has already been OCR'd by some other program. However, it will certainly index the content for searches.

One caveat: the last time I checked, even if you download all of your notes for offline use on iOS, if you do not have an Internet connection, then the PDF contents will not show up in search results. This has never been documented, so perhaps it is something they hope to get working sometime in the future? Anyhow, the workaround that I use is to textify all of my PDFs (use Automator on the Mac to extract text from multiple PDFs) and put that into Evernote. When I come across something in one of my textified notes, if I need to refer to the original PDF (usually living in Dropbox) it has the same file name, so it is easily found.

Link to comment

I am referring back to an earlier cross-referenced post by Heather on OCR limitations as well as the knowledgebase.  Evernote apparently now will OCR a file up to 50 MB (instead of 25) and 100 pages.  My question is what happens when a large file has already been OCR'd before it is added to Evernote.  Specifically, will Evernote index the text and make it searchable?

Link to comment

I'm not so familiar with the tool you are using now.

I found an OCR component  that could be re-edited online according to your own requirements,but I'm a green hand here.

who can tell me the diffrerence between them ?

 

 

 

Many thanks

 

 

Sincerely,

Arron

 

Not sure what you mean by "re-edited online", but your link points to a commercial toolkit for use in writing .NET applications (typically, for Windows) - I don't think that will help you get OCR text out of Evernote. 

 

The tool I linked to - which works on a Mac (OS X) only - is a crude solution to *extract* the text that Evernote has recognized in images via OCR (Evernote is currently using this text solely to make such text searchable, without providing direct access to it).

Link to comment

As for the ability to copy OCRed text from the images embedded in a note: There's a somewhat crude, but still potentially useful OS X-only solution here (installs as an OS X service that you can invoke with the note of interest selected - OCRed text is then copied to clipboard.)

Link to comment

Thanks for the encouragement... It's been over a week since I upgraded to Premium and the test case file I'm watching still hasn't been indexed. Is that still a reasonable amount of time? Also, is there anyone to force an individual note to get indexed?

Link to comment

I just upgraded to Premium after many years of use. I have lots of PDFs attached to notes. Now that I am premium, will the EN servers automatically start running OCR against all of the PDFs in my existing notes, or do I somehow have to trigger that process. (I'm hoping I don't have to recreate the note!)


 


Thanks.

Link to comment
  • Level 5*

Google docs has OCR and will "create a stream of contiguous words to match whatever text it finds" in the jpg you upload. I have found that it will not do a negative (white on black) text pic though.

Hi. Welcome to the forums.

I just found out that Google Drive will not index anything after the first 10 pages of a PDF. I was a bit disappointed to find out how limited it is. I have the PDFs OCR'd, so you'd think it would be no big deal to index it all. What I plan to do as a workaround is to extract the text myself and stick it into a document.

Link to comment

Google docs has OCR and will "create a stream of contiguous words to match whatever text it finds" in the jpg you upload. I have found that it will not do a negative (white on black) text pic though.

Link to comment

I take a screen shot of some text. I put that screen shot in a note. After Evernote syncs, I can search for a word within that screen shot and Evernote finds it. Evernote has converted the picture of the words into searchable text.

Is there a way to copy some or all of that searchable text and paste it into another document as editable text?

Link to comment

Thanks to the support of Giovanna from EN Support I figured my problems out. I thought EN gives me a searchable PDF that I'd be able to select text from. jbenson2 pointed that out.

I must say that I am pretty unhappy with that and am going back to my old workflow (like jbenson2) and index my PDFs locally.

Link to comment

Just as there is no way to be "a little bit pregnant", you're either a premium subscriber, or your're not, as far as I can tell.

Yeah, guess I should rephrase my answer. There is partiality toward premium accounts vs free accounts. But no partiality toward a premium account that will not be renewed vs one that will. :)

I have a premium account that (right now) won't be renewed in january.

@BurgersNFrieds: All these criteria meet my situation, so I'll be posting an issue on your support link.

Link to comment

There is no partiality toward premium accounts. Be sure you've sync'd, since after the PDF is OCR'd, it needs to be sync'd down to your desktop. Also,

We only attempt to process an Image-based PDF if all of the following conditions are met:

The raw PDF is 25 megabytes or less.

The scan contains no more than 100 pages.

The raw PDF doesn't already contain "searchable" text that you can select and copy.

The PDF isn't encrypted or protected with a passphrase.

The PDF is not of an handwritten document.

If you believe your PDFs meet these criteria, please click the link in my signature to file a support request and we'll examine further.

Link to comment

#1 - yes

#2 - a different, searchable PDF is created that can be saved to your computer, should you want/need it. (Right click the PDF in the Evernote & save searchable.)

#1

It's not "magical". It is based on reality. Premium users get priority over the free users. And yes, it does need to be sync'd back.

Evernote: "Once a file has been recognized, you'll need to sync the results back to your account to see them in your local client."

#2

There is an easy to prove my point. Pull out a PDF that Evernote performed the OCR on. Try searching the PDF outside of Evernote.

Evernote: "Once the PDF has been deemed valid for processing, the PDF is run through our best-of-breed OCR engine which
generates a searchable form of the
same
PDF
. This version is synced back down to the user's desktop and mobile client applications."

First of all: thank you guys for the quick answer.

To #1: I just wonder how long it takes, because I dropped a pdf into EN this morning and It still got no OCR.

I have a wild theory: Since I am not a friend of premium subscriptions I canceled the premium account to the end of this month, BUT right now I am a paying premium member. Could it be that accounts that are not "unlimited" premium members are treated that way?

To #2: Since I dont even get my PDFs/Notes inside EN indexed I can not check that right now.

So still my documents are not being indexed. Any hints what I can do?

Sincerely,

Gregory

Link to comment
  • Level 5

#1

It's not "magical". It is based on reality. Premium users get priority over the free users. And yes, it does need to be sync'd back.

Evernote: "Once a file has been recognized, you'll need to sync the results back to your account to see them in your local client."

#2

There is an easy to prove my point. Pull out a PDF that Evernote performed the OCR on. Try searching the PDF outside of Evernote.

Evernote: "Once the PDF has been deemed valid for processing, the PDF is run through our best-of-breed OCR engine which
generates a searchable form of the
same
PDF
. This version is synced back down to the user's desktop and mobile client applications."

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...