Jump to content

1st EN Question regarding ENs OCR


radial9

Recommended Posts

Did some searching here and did not find answer to this question. 

My understanding is that the pdf OCR EN uses to store docs on their servers is proprietary. If in the future I download a doc back into my PC from my notes the EN OCR is stripped making the pdf doc unsearchable. Question: if I first run the doc through SnapScan Organizer, which places OCR with the pdf doc, does that OCR “stay” with the doc through its journey to the EN servers AND back down in the future or do the EN servers remove the Snapscan ABBYYY pdf OCR and replace it with theirs?

Thanks for any insight here.

Link to comment
  • Level 5*

Hi.  In my experience it's generally better to OCR your own documents for a number of reasons including that one.  If you OCR a document - ie create a searchable PDF file - it will remain searchable whether or not it is attached to a note.

You also know it has been done -

  • the server-based OCR can take a little while,  and by definition has to be done on the server;  so your search capability is delayed by at least two syncs,  maybe more if the OCR server is busy.
  • there are some restrictions* on what can be OCR'd - file size/ number of pages/ searchable content;  it may not make the cut.

* the list:

Filesize 50MB or less,
100 pages or less
No “searchable” text - characters that can be selected and copied
No encryption
Not a handwritten document.

Image OCR: max size 3000 x 2400px at 300 DPI.

Hope that helps...  :)

Link to comment
  • Level 5*
13 hours ago, radial9 said:

Did some searching here and did not find answer to this question. 

My understanding is that the pdf OCR EN uses to store docs on their servers is proprietary. If in the future I download a doc back into my PC from my notes the EN OCR is stripped making the pdf doc unsearchable. Question: if I first run the doc through SnapScan Organizer, which places OCR with the pdf doc, does that OCR “stay” with the doc through its journey to the EN servers AND back down in the future or do the EN servers remove the Snapscan ABBYYY pdf OCR and replace it with theirs?

Yes - if you OCR your documents, it does not get removed when filing in Evernote

The Evernote process is documented here https://blog.evernote.com/tech/2013/07/18/how-evernotes-image-recognition-works/
and includes images, pdfs and handwriting

This is used for search purposes, and is not suitable for converting your documents to text

I would turn on your scanner's OCR function.  I don't have that option, and I'm not missing it; however if/when I stop using Evernote, I'll have to run a mass OCR processing

 

Link to comment
  • 1 month later...

Okay, I am having problems already with the Evernote OCR.  It is refusing to index some PDF files that contain reports as images, and I don't think there is any clear way to investigate why.   When I select the Note with the PDF I want indexed, and I go to the Windows app function Note | Word and Resource Counts, EverNote says I have one PDF document and zero PDF documents with Evernote OCR.   I see no action to manually try to force the document to go through OCR.   Very frustrating.   If the feature does work, at very least the error checking and feedback to the user is terrible.

I am ready to go with @gazumped's recommendation to use external OCR.   What I really want is an application that will preserve the existing images in the PDF file, and then create an index against the words it finds in those images and store that invisibly in the background.    I have one PDF application PDF Architect 5 that creates the OCR output as a new PDF file, and it mangles enough of the appearance that I won't accept that as the solution.

My documents are technical and have lots of long technical acronyms in them.   I need absolutely first-rate OCR quality, but it is also a competitive market with lots of OCR tools, and I would prefer to be below $100.   What are some good options?   I am using Windows only, but if it matters my correspondents often use MacOS.   So whatever OCR format is used, it should be something that would work with any ordinary Acrobat reader.   And it goes without saying that the OCR data needs to be correctly readable by EverNote, because in the big picture I want the search terms in each PDF to get into the EverNote index of the Notebook.

Link to comment
  • Level 5*
13 hours ago, persistentone said:

Okay, I am having problems already with the Evernote OCR.  It is refusing to index some PDF files that contain reports as images

I'm wondering about your description of "reports as images"
Do you really have report image files inside a pdf; that's unusual
Can you post a sample of a report so we can test it ourselves?

The OCR/Index process is documented here http://blog.evernote.com/tech/2013/07/18/how-evernotes-image-recognition-works/

Here's an example of an inexpensive external OCR app you can try out
http://solutions.weblite.ca/pdfocrx/
It has a free trial edition you can try out

Link to comment
13 hours ago, DTLow said:

I'm wondering about your description of "reports as images"
Do you really have report image files inside a pdf; that's unusual
Can you post a sample of a report so we can test it ourselves?

The OCR/Index process is documented here http://blog.evernote.com/tech/2013/07/18/how-evernotes-image-recognition-works/

Here's an example of an inexpensive external OCR app you can try out
http://solutions.weblite.ca/pdfocrx/
It has a free trial edition you can try out

 

I am scanning paper documents from a Brother printer/scanner, and the Brother software creates a PDF file from those scans.  There is no option in that software to use OCR when creating the PDF.   In the PDFs that are created from the scanner, you cannot select text.  You cannot search text.   What else can this be but a scanned image of a printed page, inserted into the PDF file?   What is the better description for me to use for this case?

How can I investigate the content of a PDF file, to learn what image format - for example - is being used for each stored page, when text and font and page description information is not available?

I will try out the tool you are recommending.   That looks like the kind of function I want.   I wonder how accurate the OCR would be from a product produced by such a small company.

I am willing to create a test Notebook for you and publish it to you.   Is there a specific Evernote name I should use when sharing to you?

Link to comment
  • Level 5*
1 hour ago, persistentone said:

I am scanning paper documents from a Brother printer/scanner, and the Brother software creates a PDF file from those scans.  There is no option in that software to use OCR when creating the PDF.   In the PDFs that are created from the scanner, you cannot select text.  You cannot search text.   What else can this be but a scanned image of a printed page, inserted into the PDF file?

No problem.  My scanner produces the same image PDFs. 
I was just concerned there were separate embedded image files.

No need to share a Notebook.  
My process is to share an individual note by generating a public link and posting the url, for example https://www.evernote.com/shard/s10/sh/5c8c7251-4228-4457-ba44-44b3da2e6792/83c560e1aa802b91ce5d0a470404c44c

You can also attach a file to your post in this forum  Test PDF.pdf

Link to comment
  • Level 5*
13 hours ago, DTLow said:

Here's an example of an inexpensive external OCR app you can try out
http://solutions.weblite.ca/pdfocrx/
It has a free trial edition you can try out

 

37 minutes ago, persistentone said:

I will try out the tool you are recommending.   That looks like the kind of function I want.   I wonder how accurate the OCR would be from a product produced by such a small company.

I was satisfied with my initial review of this software, (ymmv)
but I didn't do extensive testing

Currently, I'm happy with the EN OCR/Search feature
Someday, I may have to address this and do a mass batch OCR
There's also the issue of handwriting conversion

Link to comment
5 hours ago, DTLow said:

 

I was satisfied with my initial review of this software, (ymmv)
but I didn't do extensive testing

Currently, I'm happy with the EN OCR/Search feature
Someday, I may have to address this and do a mass batch OCR
There's also the issue of handwriting conversion

 

The weblite PDF/OCR on Windows is triggering Microsoft's Smartscreen anti-virus.   Smartscreen refuses to let it run without an admin password, and naturally Microsoft doesn't give me any information at all about what their reasoning for this was.   Any idea what is going on there?

Link to comment
  • Level 5*

I routinely batch OCR my PDFs.  I scan to folder on my hard drive,  edit / rename files as necessary,  then bulk OCR everything and drop files to an Import Folder which sucks them into Evernote.  I prefer this process because I know the PDFs are scanned and searchable immediately they're uploaded,  I can deal with any errors when they occur,  and I can process files that fall outside Evernote's comfort zone for OCR - including ones bound for local notebooks.

My Fujitsu scanner box included both ABBYY and Adobe - the scanner using ABBYY for its automatic OCR.  My software of choice is Adobe however - converting the graphics of scanned pages into characters gives a substantial file size saving,  which can be important when using Basic accounts.  Adobe seems to perform best for me with around a 40% size reduction.  

The choice of PDF software is pretty wide - if used manually,  it's an open choice - see what's available to fit your budget:  http://www.toptenreviews.com/business/software/best-ocr-software/

Link to comment
  • Level 5*
5 hours ago, persistentone said:

The weblite PDF/OCR on Windows is triggering Microsoft's Smartscreen anti-virus.   Smartscreen refuses to let it run without an admin password, and naturally Microsoft doesn't give me any information at all about what their reasoning for this was.   Any idea what is going on there?

Sorry, no info

I'm on a Mac, and don't even use the software much.  I wouldn't waste time on it - check out the alternatives in @gazumped's post

Link to comment
  • Level 5

I use ABBYY FineReader, which does a great job of converting image-PDFs into text (savable as PDF or RTF) in multiple languages. It's the automatic indexing, though, that @persistentone desires that seems to me to be the sticking point. Is there software that does automatic indexing? Sounds very handy if there is!

Link to comment
  • Level 5*
20 minutes ago, Dave-in-Decatur said:

Is there software that does automatic indexing?

Hmmn.  I read that as just providing an search index of the content of a PDF - another post somewhere recently asked about more complicated stuff like preparing an index of source material,  and I believe there's a whole range of academic software that does a far more comprehensive analysis of metadata and text than 'just' OCR.  I've used something called Docear if you want to take a look...http://www.sciplore.org/projects/docear/

ABBYY and Adobe of course add another layer to the PDF file after OCR which consists of the actual text content,  so the searchable file can be downloaded and shared as necessary.  The Evernote server-based OCR just keeps a local index in the note which is not downloaded with the main document.  It used to be possible to download that as well as,  or instead of the original PDF.  Don't know if that's still possible.

Link to comment
  • Level 5*
3 hours ago, Dave-in-Decatur said:

It's the automatic indexing, though, that @persistentone desires that seems to me to be the sticking point. Is there software that does automatic indexing? Sounds very handy if there is!

I like the indexing in Evernote, but if your pdfs are ocr'd  I don't think the indexing software is a concern.
I could leave the pdf file in my Mac OS and use the Finder search feature

Link to comment
  • Level 5*
12 hours ago, persistentone said:

As long as we are on the topic of doing OCR on a PDF using a tool external to EverNote:  is there any reader that will let me examine the full text associated with the image so that I can edit out the mistakes?

I used a version of ABBYY a while ago that would present an OCR'd document with highlights on any unrecognized words so the output could be edited where necessary.  Don't know if that feature still exists - I think you'd have to check out the top few apps on your prospective purchase list to confirm what degree of 'training' (for the recognition software) is available.  It should be possible to add generic medical or scientific terms as additional libraries too.

Link to comment
  • Level 5*
12 hours ago, persistentone said:

As long as we are on the topic of doing OCR on a PDF using a tool external to EverNote:  is there any reader that will let me examine the full text associated with the image so that I can edit out the mistakes?

I used a version of ABBYY a while ago that would present an OCR'd document with highlights on any unrecognized words so the output could be edited where necessary.  Don't know if that feature still exists - I think you'd have to check out the top few apps on your prospective purchase list to confirm what degree of 'training' (for the recognition software) is available.  It should be possible to add generic medical or scientific terms as additional libraries too.

Link to comment
  • Level 5*
12 hours ago, persistentone said:

As long as we are on the topic of doing OCR on a PDF using a tool external to EverNote:  is there any reader that will let me examine the full text associated with the image so that I can edit out the mistakes?

I used a version of ABBYY a while ago that would present an OCR'd document with highlights on any unrecognized words so the output could be edited where necessary.  Don't know if that feature still exists - I think you'd have to check out the top few apps on your prospective purchase list to confirm what degree of 'training' (for the recognition software) is available.  It should be possible to add generic medical or scientific terms as additional libraries too.

Link to comment
10 hours ago, gazumped said:

I used a version of ABBYY a while ago that would present an OCR'd document with highlights on any unrecognized words so the output could be edited where necessary.  Don't know if that feature still exists - I think you'd have to check out the top few apps on your prospective purchase list to confirm what degree of 'training' (for the recognition software) is available.  It should be possible to add generic medical or scientific terms as additional libraries too.

 

That's an interesting feature but that is different than what I am suggesting.   I want to edit the entire full-text document, just so I can correct mistakes made by the software for terms it think it got right but actually got wrong.   I understand your point I need to look around.  I am hoping someone has already identified the software and will respond on this thread some day in the future.

Link to comment
  • Level 5
10 hours ago, persistentone said:

That's an interesting feature but that is different than what I am suggesting.   I want to edit the entire full-text document, just so I can correct mistakes made by the software for terms it think it got right but actually got wrong.   I understand your point I need to look around.  I am hoping someone has already identified the software and will respond on this thread some day in the future.

I think ABBYY FineReader would do what you want. You can learn about its features and download a trial copy at https://www.abbyy.com/en-us/finereader/.

Link to comment
  • Level 5*
On 2017-03-09 at 7:55 PM, persistentone said:

I want to edit the entire full-text document, just so I can correct mistakes made by the software for terms it think it got right but actually got wrong. 

Check out   http://help.abbyy.com/FineReader/FineReader12/English/CheckResults/StepChecking.htm

Once the OCR process is complete, the recognized text appears in the Text window. The characters recognized with low confidence will be highlighted, so that you can easily spot the OCR errors and correct them. 

You can edit recognized texts either directly in the Text window or in the Verification dialog box (click Tools > Verification… to open the dialog box). In the Verification dialog box, you can review low-confidence words, correct spelling errors, and add new words to the user dictionary.

Link to comment
19 hours ago, DTLow said:

Check out   http://help.abbyy.com/FineReader/FineReader12/English/CheckResults/StepChecking.htm

Once the OCR process is complete, the recognized text appears in the Text window. The characters recognized with low confidence will be highlighted, so that you can easily spot the OCR errors and correct them. 

You can edit recognized texts either directly in the Text window or in the Verification dialog box (click Tools > Verification… to open the dialog box). In the Verification dialog box, you can review low-confidence words, correct spelling errors, and add new words to the user dictionary.

 

That looks great.   It costs about 300% what a Premium account on EverNote costs, so it is not cheap.   But I have bookmarked it and I may have no choice.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...