Jump to content

Separate OCR workflow?


Recommended Posts

I am not getting the Premium version of Evernote because:

 

1. The OCR is not as accurate as the leaders such as ABBYY and 

2. I don't need all the other add-on features.

 

As such, I'd like to ask if anyone has an established workflow (both Android and Windows) on seamlessly OCR-ing documents before archiving them in Evernote. Docs can be in image or PDF format with text.

Link to comment
  • Level 5*

Hi.  I am a Premium user,  and I still use third-party software (Adobe) to OCR my scans.  My scanner (a venerable Fujitsu ScanSnap S1500) will -in theory- OCR as it scans,  but I find that the scanning process takes a little longer that way.  So mine is set up to scan PDFs to a desktop folder without OCR.  When I've finished scanning,  or at the end of the day if I forgot earlier,  I'll use a feature of Adobe which will batch-OCR a group of files.  I apply that to my scanning folder,  and then move the OCR'd content to an Import folder (Windows) so it can be absorbed into Evernote as separate notes.  I find the folder OCR method better too because it

  • allows me A chance to "edit" the files -
    • my indexing system needs a date/type/focus/keywords title for each note,  so I rename the file to reflect that.
    • occasional scanning failures sometimes mean I need to add / delete / rotate pages
  • reduces the file size by up to 50% (with Premium upload and note limits that's not so important, but it keeps the size of my local database to a minimum)
  • means the file is immediately searchable rather than having to wait an indeterminate time for Evernote's servers to get around to it.  That's never very long,  but in this case I know my file searchable now.  Plus there are some restrictions on file size and page numbers that I don't have to worry about.

...Other OCR systems like ABBYY are available...

As to mobiles,  I tend to use Evernote's built-in camera / scan feature.  I don't scan multi-page documents to Android very often and the image / single page OCRs that I need are very rarely critical for OCR.  I do have various third-party apps including Adobe Scan and MS Office Lens which will scan and OCR to PDF.

-On the general subject of image vs PDF I'd always say that OCR'd PDFs are best unless you really really need to see the content inline;  image searches can't be as accurate as OCR PDF because image software tends to allow 'trees' of possible meanings,  so the word 'horse' could also register as 'house' 'hearse' and 'harsh'.  Images do show up immediately on mobiles though,  where PDFs are a downloadable attachment.

Link to comment
  • Level 5*
10 hours ago, Titus said:

Separate OCR workflow?

I have an inexpensive OCR app (PDF OCR X), but I only use it when required.   
I could implement a seamless workflow, but it's not important for me    
- I rely on Evernote's OCR/search feature   
- I haven't found an OCR app for handwriting (actually ICR)

Evernote's OCR/search feature is included in the Premium package, and works "seamlessly" and transparently
- the transcription is stored in a hidden text file; this is indexed by the search feature 
- includes handwriting  

>>1. The OCR is not as accurate as the leaders such as ABBYY

No idea's on accuracy; it all seems to work well for me   
You might be interested in this article https://www.abbyy.com/en-ca/casestudies/evernote-accelerates-company-growth-by-enlisting-abbyy/#sthash.8r5Pyekx.dpbs

>>2. I don't need all the other add-on features.

Add-on features I need are upload and device limits.  I can not survive on the Basic restrictions   
I appreciate the other add-on features and make use of them

Link to comment
  • Level 5*
7 hours ago, gazumped said:

to a desktop folder

I see this as the first step for seamless OCR   
Possibly a cloud folder for both mobile/desktop use

>>As to mobiles,  I tend to use Evernote's built-in camera / scan feature.  I don't scan multi-page documents to Android very often

99% of my scanning is with a  scanner app on my iPad; including multi-page documents

Some scanner apps have OCR built in

>>occasional scanning failures sometimes mean I need to add / delete / rotate pages

Same here.  I do this adjustment within the scanner app

Link to comment
  • Level 5*
On 11/17/2019 at 2:33 AM, Titus said:

As such, I'd like to ask if anyone has an established workflow (both Android and Windows) on seamlessly OCR-ing documents before archiving them in Evernote. Docs can be in image or PDF format with text.

  1. I use a ScanSnap s1300i to scan documents to an import folder such that the document ends up in my Scans notebook.  I let the ScanSnap OCR as it scans.  The OCR is typically done by the time I am ready to scan another document.  I tag, re-title if need be, and move the note to it's notebook from the Scans notebook.
  2. I use Adobe to OCR any documents that I download that aren't already OCR'd.  This is includes documents that contain renderable text.  This is s a bit of a PITA but best way I have found to do it (a couple years back anyway).  Method below, faster that it looks but still...

OCR PDF with renderable text

  1. Open the PDF document in Acrobat
  2. File > Save As
  3. Choose TIFF (*.tif, *.tiff)
  4. Save to desktop or wherever (Acrobat saves each page of the PDF document as a separate, sequentially numbered TIFF file)
  5. Open the first TIFF file with Acrobat
  6. Go to pages view
  7. Drag each subsequent page into the PDF n order
  8. Tools - Recognize Text
  9. Exit Adobe and save as to Desktop
  10. Drag into EN

 

Link to comment
  • Level 5

Documents I scan with my ix500 Scanner get the OCR directly through the ABBY software that came with the scanner.

Documents I scan with my ScannerPro-App on my iPhone get the OCR from that app.

All other pdf and picture imports into EN get scanned by the EN server function.

Because EN auto-detects files that were not OCRed before, and goes to work on them, the manual effort in this approach is nearly zero.

From my experience the scan results are not far off, maybe (without having analyzed this deeply) it depends on the document type and quality which scanner will perform better on which pdf. What amazes me is how good the OCR is even working on the finest 5pt-print of business letter templates. All OCRs serve their purpose to make the files retrievable by full text search. And that is what I need it for.

Link to comment
  • Level 5*
32 minutes ago, PinkElephant said:

From my experience the scan results are not far off, maybe (without having analyzed this deeply) it depends on the document type and quality which scanner will perform better on which pdf. What amazes me is how good the OCR is even working on the finest 5pt-print of business letter templates. All OCRs serve their purpose to make the files retrievable by full text search. And that is what I need it for.

I've found that my local ScanSnap and Adobe OCR is more accurate than the EN server version.  More accurate meaning more occurrences of the search term are found locally than via the web.  I filed a ticket some years back when note counts for searches were different on desktop and web.  The  problem was recognized.  Never did get a response as to what had been done.

  • Thanks 1
Link to comment
  • Level 5

Interesting observation. When I had a difference, I reindexed my search index locally, and got identical results from both sources after that.

From what I learned, EN will not OCR a file that already comes with a certain word count. These are regarded as in no need to be OCRed. It would make no difference whether the word count results from the pdf creation through a direct operation (like a word to pdf „save as“) or from an external OCR. When a pdf contains a relatively significant amount of text, it is saved as it is by EN, without OCR, my information tells.

If this is true, search on the server and locally should work on the same data, and should return the same results.

This would not exclude that when the same pdf is once OCRed outside of EN, and once by the EN server, there may be a gap between search results. But this would probably depend on the OCR quality itself, because these 2 attachments would be treated differently, only the unOCRed being OCRed by EN on the server after upload.

Maybe it would be worth to find out. @CalS You could give it a try, if you would upload say the next 20 pdfs created through the Adobe process in 2 versions, one of them for EN to OCR them, and then compare search results between these docs. I assume the difference to be zero.

Link to comment
  • Level 5*

I found the above issue when I was searching for a receipt.  If memory serves Desktop came back with 300 or so results and IOS (basically web) came back with 225 or so.  So I contacted support and below is the salient reply from 2018, highlighted bit in red being key I think. 

Not really sure how the OCR process works re the EN servers.  It seems to me things are OCRd on the server but the results do not overlay what is on the desktop if the PDF was already OCRd on the desktop, nor does the desktop overlay what is on the server.  Hence the differences.  So if one syncs a PDF with a non-monospace font to the server, OCR results may or may  not match the same PDF OCRd locally.  Anyway, have not heard back as to whether this issue was ever resolved.  Though current searches would say probably not.  Moral of the story - I do important text searches on the desktop.

Anna H. (Evernote Help and Learning)
Aug 17, 14:30 PDT
 
Hi Calvin,
 
Thanks so much for the files. I was able to run this past a member of our search team, and they were able to shed some light on what seems to be happening.
 
To first elaborate on why it works on Windows when compared to iOS - Windows uses a different search mechanism to extract text from PDFs than the one used by our server, which is what web and iOS use.
 
The issue you're seeing is related to server-side search, which seems to have an issue with recognizing the wrong font when scanning PDF files. The text in the PDF is printed in a monospaced font, so the letters are wider, but the OCR sometimes incorrectly recognizes it as a non-monospace font, so it sees the gap in position and thinks there's a space there.
 
To confirm this, using the examples that you provided, searching for 'Prairie' only brings up one of the two PDFs, but searching for 'P r a i r i e' brings them both up.
 
Our team is aware of this and they are already looking into possible solutions. I apologize for the inconvenience in the meantime, I understand that this doesn't leave you with an immediate solution, but hopefully we can get this fixed soon.
 
Let me know if you have any additional questions.
 
Best,
Anna H.
  • Thanks 1
Link to comment
  • Level 5

Interesting reading. When I read it correctly, the server OCR has / had a problem with monospaced fonts, which is practically the former typewriter look - whenever a letter was printed on the paper, the machine moved for the same step forward to print the next one. I think on Windows the Courier font was the most common of its type.

It seems the OCR used non monospace, which introduced a series of imaginary spaces between the letters.

It should be possible to „heal“ this, most OCR programs have no problem with both font families. In fact when identified as monospace it should be quite easy to OCR, because letters never overlap.

I remember another issue: When OCRed with the pdf, the OCR result is stored right in the pdf. When EN does it on the server, the OCR result is stored in a hidden section somewhere in the note, not in the pdf itself. At least I have never noticed a pdf changed by EN, so probably this is true.

When I have time and my different clients up and running, I think I will repeat my tests on search results. Last time after rebuilding the index I came out with no significant difference.

Link to comment
  • Level 5*
25 minutes ago, PinkElephant said:

Interesting reading. When I read it correctly, the server OCR has / had a problem with monospaced fonts, which is practically the former typewriter look - whenever a letter was printed on the paper, the machine moved for the same step forward to print the next one. I think on Windows the Courier font was the most common of its type.

It seems the OCR used non monospace, which introduced a series of imaginary spaces between the letters.

It should be possible to „heal“ this, most OCR programs have no problem with both font families. In fact when identified as monospace it should be quite easy to OCR, because letters never overlap.

I remember another issue: When OCRed with the pdf, the OCR result is stored right in the pdf. When EN does it on the server, the OCR result is stored in a hidden section somewhere in the note, not in the pdf itself. At least I have never noticed a pdf changed by EN, so probably this is true.

When I have time and my different clients up and running, I think I will repeat my tests on search results. Last time after rebuilding the index I came out with no significant difference.

Post back what you find if you wouldn't mind.  Interesting tidbit.

  1. I scan and OCR something into EN.
  2. For a text search the desktop finds the PDF, IOS/Web does not.
  3. I OCR the PDF from within EN on the desktop (open in Adobe, recognize text, exit and save back to EN).
  4. The text search now works in IOS/Web.

It's like doing the OCR within the desktop app sends the OCR results to the server whereas the initial note create does not.  Both "versions" of the PDF work on the desktop only the second version works on IOS/Web.  Weird.

  • Like 1
Link to comment
  • Level 5*

One of the re-OCRs was from July.

Also noted in the process that a note containing an Excel document appeared in the Desktop results though not in IOS/web results.  This is problematic in that Office docs are supposed to be included in searches, a second issue in any case.

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...