Jump to content
lmason

windows Evernote PDF OCR vs SnapScan OCR

Recommended Posts

I let my SnapScan do OCR on pages I'm scanning. According to the following in Evernote's Knowledge Base, then these PDFs are not OCR'd by Evernote. Evernote says they have the best-of-breed OCR engine .

Is Evernote's OCR better than what SnapScan can do?

How should I, a premium user, do OCR in the future? Let SnapScan do it or let EN do it?

Thanks.

___________________________--

How does Evernote's PDF processing work for scanned PDFs?

Our PDF processing system is built to allow people to use a scanner to take a scanned document and make it searchable within Evernote, even if the scanner doesn't perform any type of optical character recognition (OCR) processing when the document is scanned. To make sure that this process is fast and efficient, we have instituted a set of boundaries in the system that try to approximate what a "regular user" would do with a scanner (which we'll describe in a minute).

When a note is synced to Evernote that contains a file with the "application/pdf" MIME type (more info on MIME types here), it is placed in the queue for processing. Note that the file doesn't necessarily need to have the .pdf extension. Once the PDF gets to the front of the processing queue, the processor analyzes the file to ensure it qualifies to be processed/recognized. The processor will reject the PDF if any of the following conditions are met:

The PDF contains more than 100 pages

The PDF file is more than 25MB

The PDF does not contain at least one "scanned" page, defined as: A "scanned" page contains at least 1025 pixels of image data

A "scanned" page contains no more than 512 characters of regular, searchable text (e.g. this is enough for a text-based fax header or similar). PDF files that have already been processed by a separate OCR system will not satisfy this condition and will be rejected.

The PDF contains no more than one non-scanned page. (I.e. the doc may have one "cover" page without any image data, but if there's more than one, than it's not a real scan and we reject it.)

The analysis crashes or fails for some technical reason, typically due to a malformed PDF from some crazy source, or if the PDF is password protected (encrypted).

This analysis process takes more than 30 seconds to complete.

Once the PDF has been deemed valid for processing, the PDF is run through our best-of-breed OCR engine which generates a searchable form of the same PDF. This version is synced back down to the user's desktop and mobile client applications.

Share this post


Link to post

It takes a bit longer, but I always let ScanSnap do the OCR for me.

Why?

1.) Exported PDFs:
ScanSnap: The PDF document remains OCR'd if I export it from Evernote.
Evernote: The PDF document loses its OCR if I export it from Evernote.

 

2.) Consistency:
ScanSnap: The search results are consistent in Evernote, whether I view them from my desktop client or the Evernote web.
Evernote: The search results are not consistent because Evernote uses different OCR software depending on the platform.

 

3.) 100% OCR:
ScanSnap: Works on notes that are stored in my local non-sync'd Evernote notebooks.
Evernote: Evernote cannot see my notes on my local non-sync'd notebooks, so the PDF's cannot be OCR'd.

 

4.) No complex rules:
ScanSnap: OCR's all my PDF's - no rules and I know it is done.
Evernote: Evernote has 5 technical rules to follow and no warning if the document fails to meet all the rules.
.

  • Like 1

Share this post


Link to post

Ok, now that I've scanned about 50 documents or so into EN, how will I know if they've been OCR'd? I'll do all my new ones with the ocr on within ScanSnap, but what do I do with the ones in EverNote already? They seem to meet all the criteria.

Share this post


Link to post

Also, now that I'm doing more, ocr'd within SS, howcome the searched text isn't highlighted anymore. Is that only for searched jpeg's? Grr!

Share this post


Link to post

And here's another interesting question - I have a ScanSnap 1500 (which bundles with Adobe) that OCRs a document for me while (OK, after) scanning it.

I recently had to add a new scanned page or two to a PDF file by merging files (different formats), so I chose the Adobe "OCR document" option to make sure the new file had a contiguous OCR. I had saved the merged file as a new file and it came out as 1.2MB. After the new OCR I re-saved the file.. which was now 900K. I've tried this for a few large files since - in my recent experience a ScanSnap OCR file size is always reduced by 10% or more if the Adobe OCR file option is used after scanning. Both files seem equally searchable, but the extra OCR definitely squishes the size a little more. Useful if you're pushing the 50MB note maximum at some time, and always useful to take best advantage of your monthly upload limits.

Share this post


Link to post

Sorry to 'bump' this but I'm curious to know on both the Scansnap 1300 and 1500 printers, what is the increase in processing speed for say a single page scan if the printer does the OCR. I'm about to buy one of these.

Thanks

Share this post


Link to post

Sorry to 'bump' this but I'm curious to know on both the Scansnap 1300 and 1500 printers, what is the increase in processing speed for say a single page scan if the printer does the OCR. I'm about to buy one of these.

Thanks

Presumably, with all of their super hardware, Evernote is going to be faster at processing documents. But, the devil is in the details. They don't tell you when they will get around to doing your document. With Scansnap, you do it right then and there. Depending on your document, it will go through it at a couple of seconds or so per page, so it will finish quite quickly. I always ocr everything before putting it into Evernote.

Share this post


Link to post

I think I want to try something different I don't want to wait anymore for all of my pages to convert.

so I thought I would buy the abbyy fine reader and set it to convert all non searchable documents in a folder( the EN folder)

Right now it will watch a folder but it has to be scanned with the scanner (Finereader for Fujistu)

Do you think this process would work? So after the system updates the pdf it stull syncs to EN?

Share this post


Link to post

buy the abbyy fine reader and set it to convert all non searchable documents in a folder( the EN folder)

Right now it will watch a folder but it has to be scanned with the scanner (Finereader for Fujistu)

Do you think this process would work? So after the system updates the pdf it stull syncs to EN?

If you have other PDFs (not created with ScanSnap), you can also do bulk-OCR with Acrobat 9. It doesn't have "watch folder" feature, but you can open a "OCR multiple" window and drag all your files there.

Share this post


Link to post

Another problem I have seen is that the snapscan's OCR often leaves spaces between characters so if someone's name is John Doe, if I select the text and copy it, it will come out like J o hn Do e with random spaces inserted. This breaks the ability to have evernote highlight searched for terms because if I search for John it may pull the note but it won't highly john unless I was searching for "j o hn". Very frustrating. Whats more frustrating is that having evernote OCR the files isn't an option either because I don't want to lose the OCR if I pull the PDF out of evernote.

Arg...

Evernote, can you please leave the OCR on files for premium users at least so that I can ditch the shoddy pre-evernote OCR steps?

Share this post


Link to post

Another problem I have seen is that the snapscan's OCR often leaves spaces between characters so if someone's name is John Doe, if I select the text and copy it, it will come out like J o hn Do e with random spaces inserted. This breaks the ability to have evernote highlight searched for terms because if I search for John it may pull the note but it won't highly john unless I was searching for "j o hn". Very frustrating. Whats more frustrating is that having evernote OCR the files isn't an option either because I don't want to lose the OCR if I pull the PDF out of evernote.

Arg...

Evernote, can you please leave the OCR on files for premium users at least so that I can ditch the shoddy pre-evernote OCR steps?

Hmmmm I find the issue more with EN... interesting... have you tried increasing the resolution?

Share this post


Link to post

Arg...

Evernote, can you please leave the OCR on files for premium users at least so that I can ditch the shoddy pre-evernote OCR steps?

Ahoy! If you are not satisfied with ScanSnap's bilge rat OCR, simply turn it off before scanning. Shiver me timbers. Evernote will do the OCR. Well, blow me down!

Share this post


Link to post

I'm now aware that ScanSnap uses ABBYY OCR software for its native processing, even though Acrobat 9.0 id bundled with the machine. OCRs with ABBYY generate moderate-sized files (i.e. I've never had a problem with the sizes of multi-page documents) but OCR with Acrobat 9.0 in 'batch' mode and you get files significantly smaller - say 10-20% - which is going to make a difference in your monthly upload allowance, should you be pushing any boundaries here.

For that reason (and the odd minute or so per scan that you save) - I'd suggest scanning to folder initially without OCR, then batch OCRring via Adobe, then uploading to Evernote. You get the most efficient, portable and smallest OCR results that way.

  • Like 2

Share this post


Link to post

I'm now aware that ScanSnap uses ABBYY OCR software for its native processing, even though Acrobat 9.0 id bundled with the machine. OCRs with ABBYY generate moderate-sized files (i.e. I've never had a problem with the sizes of multi-page documents) but OCR with Acrobat 9.0 in 'batch' mode and you get files significantly smaller - say 10-20% - which is going to make a difference in your monthly upload allowance, should you be pushing any boundaries here.

For that reason (and the odd minute or so per scan that you save) - I'd suggest scanning to folder initially without OCR, then batch OCRring via Adobe, then uploading to Evernote. You get the most efficient, portable and smallest OCR results that way.

now if we could automate that process :)

Share this post


Link to post

I keep OCR on all the time on Fujitsu scan snap. However, if I'm scanning something that really doesn't need OCR'ing and I'm too lazy to wait a few seconds while it goes through the process I just hit "cancel" and it stops immediately and goes into File Save mode. I probably do this for half the stuff that I scan. I agree, by the way, with the previous comment that when you OCR a file it comes out somewhat smaller (at least 10% or even more in some cases).

Share this post


Link to post

I know this is a fairly old topic, but I thought I should resurrect it because no one's mentioned about the OCR of handwriting. I'd recently scanned some handwritten notes and converted directly to PDF without any sort of OCR processing then imported directly in to EN. I've waited a couple of weeks (as I only have a free account), and text searches still return no results containing those PDFs.

Is that normal?

Should I be importing handwritten notes as JPEGs instead, as I've found that works in searches?

I don't know of any OCR software that adds text data to PDFs of handwritten notes. Does that even exist?

Thanks in advance,

Phil

Share this post


Link to post

Ok, I just realized that it's a premium feature, so I guess even any scans of printed text converted to PDFs will not be OCR parsed by EN. I guess it's time I start thinking about going premium. Thing is I don't even use my 50 megs a month.... Yet!

Share this post


Link to post

Errm.. I'm pretty certain limits aren't meant to be targets, as such. There are lots of other reasons to go Premium including higher limits, priority processing, shared notebooks and (fanfare) a bit of monetary support for the company that created a mindblowingly good product that lots of people (+11 here) couldn't live without. (Not that I'm likely to do anything really rash if my access to technology and the old external brain were cut off - I'd just cry. A lot.)

Seriously. The subscription is peanuts compared to the benefits (peanuts - elephants.. please yourselves) and it all helps towards more and better development - like getting that darn due date delivered sometime this century... <kidding guys, honest>

:P

Share this post


Link to post

Yeah I understand limits are supposed to be targets, it's just I'm not going up to the 50 meg quota. I think once I start paying I'll envisage lots of other ways to use EN. I'll go ahead and sign up right away. :)

Share this post


Link to post

Another possibility is to hit "cancel" immediately when snapscan tries to OCR and to let Acrobat do it instead. One of the advantages of this method is that Acrobat X has pdf optimize and OCR on the same screen ie: you can select "OCR" and optimize pdf at the same time. I have found that optimizing the pdf is a tremendous way to reduce file size. The last step that I do is to save the file as a "reduced file pdf". The combination of OCR'g, optimizing and saving as reduced file will very often reduce the file size by 75% or more. I just did this on a file that originally was 7MB and is now 1.6MB.

Share this post


Link to post

I'm glad I found this thread and found I'm not the only one with this dilemma. So is there a consensus which ORC is better- EN "best of breed" or ScanScap's ORC that is performed after scanning?

I've got a couple more questions too on things I want clarify:

1. Is SnanSnap's OCR performed by the ABBYY Finereader engine?

2. If a PDF that has previously been OCR'ed and has the text information stored in it uploaded to EN (as a premium subscriber) will EN perform any more OCR with their "best of breed" OCR technologies or ignore it since the PDF already contains text information?

Share this post


Link to post

Something else I would also like to add is my annoyance at how EN and Fujitsu advertise how great their products work together. Personally I see a whole range of issues:

1. Using the ScanSnap manager you are able to scan documents straight into Evernote as PDF files. However when it comes to having OCR performed on the PDF prior to them entering EN the only way to do this is to select the option in the SnanSnap manager to perform OCR during scan. This then means you have to wait 30 seconds or so after you scan each document while OCR is performed by ScanSnap before you scan the next document. Completely unworkable.

2. You can deselect the option in ScanSnap organiser to have OCR performed during the scan and still scan straight to EN but then you have a PDF placed in Evernote with no OCR and you have to wait until EN's "best of breed" OCR gets around to processing your file. Also, once the OCR is performed by EN on the PDF you are all well and good to search for text context within the PDF in EN but once you remove the PDF from EN there is no embedded OCR data and you unable to select any text to copy and paste from the PDF etc!

3. The other option I thought of was to use a EN watch folder to watch a folder where my ScanSnap scans go into. However this is also not without problems. The way SnanScap works is that the document is scanned, a PDF created without OCR and place in a specified folder. Then shortly thereafter ScanSnap Organiser will run OCR on these PDFs. The PDFs remain the same location and with the same file name. So specifying this folder as a EN watch folder will again result in EN uploading the pre-OCR'ed version.

The only way I can see these two products working together at the moment is by the following laborious process:

1. Scan documents into the ScanSnap Organiser

2. Wait for OCR to be performed

3. Then send the files into Evernote

4. Move/delete original files

So who still thinks Fujitsu SnanSnap scanners and Evernote work great together?

Share this post


Link to post

I've done a bit of google searching on this but can't find what I'm looking for. Does anyone know of a program that can watch a folder for PDFs, then perform OCR on the PDF when new PDF is put in the folder, and then move the newly OCR'd PDF to another folder? To me this seems to be the only solution to speed my workflow.

I'm sure Fujitsu could make the ScanSnap software do this. But to add to my annoyances, this is another problem with Fujitsu. Although they make great scanners hardwarewise, their support and software is woeful. You are stuck with the same version of the SnanSnap Organiser and other bundled software for the life of the scanner and because the software that drives the scanner is proprietary there is no option to use other scanning software. I guess they do this to entice you to purchase a new scanner. This really erks me. I have a ScanSnap S510 which I've had for several year now. I see no need to replace it until it breaks. But the ScanSnap software that comes with is so dated and there has been only a handful (less than I can count on one hand) of software updates. All of which provided no extra functionality. And also, don't get me started on how hard it is to get the Windows 7 64bit drivers/software from Fujitsu....!

I wish Fujitsu would open up the specs for the scanners so someone out there could write some decent software for it (I wonder if it would be possible to reverse engineer the Fujitsu ScanSnap scanner drivers?).

Share this post


Link to post
So who still thinks Fujitsu SnanSnap scanners and Evernote work great together?

I do. If you've read the thread this far, you've gathered that folks do what serves them best; some scan direct to Evernote and let the elephant do the heavy lifting. Others scan to folder, mess with filenames and such, do the OCR, then move the files to an Import folder and let Evernote vacuum up the details.

There's various software to help you move files around and automate processes, like AHK, Belvedere, Breevy and others.

Somewhere within all that you will find the most effective process currently available for your particular needs.

I've tried photography, finding documentation on the web, flatbed scanners, copy typing and a variety of other methods of documenting stuff over the past 30+ years, and have recently converted a small reference library of documentation into 10,000+ notes in Evernote. Short of having a magic wand, I'd say the Scansnap has been the most efficient conversion method during that process.

Share this post


Link to post

another option is to allow Adobe Acrobat to do the OCR. The advantage of this is that it will both OCR and "optimize" the document at the same time (which also radically reduces the size of the file). If you have Adobe Acrobat 10.0 pro it even allows you to batch this process so that you don't have to do it individually every time (you can scan to a folder and intermittently do a batch OCR on the entire folder).

Share this post


Link to post

another option is to allow Adobe Acrobat to do the OCR. The advantage of this is that it will both OCR and "optimize" the document at the same time (which also radically reduces the size of the file). If you have Adobe Acrobat 10.0 pro it even allows you to batch this process so that you don't have to do it individually every time (you can scan to a folder and intermittently do a batch OCR on the entire folder).

I have Acrobat 8 Standard which came which came as part of the scanner software. I can see this has the ability to OCR individual documents but not multiple documents at the same time. I've looked on the Adobe website for the features of Acrobat 10 Standard and Pro in regards to this but can't find the specifics. Does Acrobat 10 (Standard or Pro) have an option to specifiy something like a watch folder where newly scanned PDF get OCR'ed and moved to another folder?

Share this post


Link to post

Hmmn. There's a slightly crossed conversation going on here - see this complementary thread. Can't speak for Adobe 8 or 10, but my middle-of-the-road 9 (standard) has an OCR option under ~Document ~OCR Text Recognition to 'Recognize Text in Multiple Files'. There's a dialogue box into which to drag/ drop the names of the folders/ files you need processed. There's an option for where the processed files are saved, so you could set this to your import folder if you're comfortable with the filenames at that stage.

While on the subject of OCR, do make sure you scan any pictures to JPG rather than PDF - not meaning illustrations in a text, but photographs or artwork - because I do believe that Evernote's ability to see the text in pictures is better (or is at least more recent and tweaked more often) than Adobe's.

Share this post


Link to post

Hi,

 

first of all Evernote is great and I really love it .. Please continue like that :-)

 

 

Yesterday I bought the Fujitsu ScanSnap IX500.

 

 

My idea is to go paperless as much as possible.

 

Therefore my goal was to scan all magazine articles into evernote and have their awesome OCR scanner do the work.

 

 

Unfortunately yesterday I found out there is a 1 GB traffic limit as a premium user.

 

In average, the magazine articles I scanned with color 150dpi have around 500kb per page...

 

That would mean I have around 2k pages only per month even though I d like to scan many articles..

 

 

My question is now:

Does anyone have the same issues and is there a solution for that?

Did anybody scan via Fujitsu to local drive and use an OCR Software like Omnipage  which I found via google?

 

http://lifehacker.com/5624781/five-best-text-recognition-tools

 

 

Most important for me is to have searchable PDF; either local or in cloud.


 

Would be very happy about your ideas.

 

Best,

Alex

Share this post


Link to post

I use an S1500 scanner to scan (at 300dpi) to a folder where I can rename/ edit / join and generally mess around with the files,  and then OCR them on Adobe 9 that came bundled with the scanner.  My files aren't that large,  and I've only gotten close to the monthly limit a couple of times.  You can buy more capacity of you need to,  or you could move your documents to a local (offline) folder until your limit resets.  I'd suggest you have a go at scanning for a month and check your usage..  have a look at your OCR settings too;  Adobe reduces the size of my scanned files by replacing pictures of text with actual characters,  which take less space.

  • Like 1

Share this post


Link to post

Another possibility is to hit "cancel" immediately when snapscan tries to OCR and to let Acrobat do it instead. One of the advantages of this method is that Acrobat X has pdf optimize and OCR on the same screen ie: you can select "OCR" and optimize pdf at the same time. I have found that optimizing the pdf is a tremendous way to reduce file size. The last step that I do is to save the file as a "reduced file pdf". The combination of OCR'g, optimizing and saving as reduced file will very often reduce the file size by 75% or more. I just did this on a file that originally was 7MB and is now 1.6MB.

I'm confused about this and having problems with some scanned notes being OCR'd and some not... how do I "let" Acrobat do this? and where does abbyy fine reader enter in?  Also when you san have OCR set in ScanSnap, are you referring to the screen where there is a check box for "Convert to Searchable PDF"?

Share this post


Link to post

 

Another possibility is to hit "cancel" immediately when snapscan tries to OCR and to let Acrobat do it instead. One of the advantages of this method is that Acrobat X has pdf optimize and OCR on the same screen ie: you can select "OCR" and optimize pdf at the same time. I have found that optimizing the pdf is a tremendous way to reduce file size. The last step that I do is to save the file as a "reduced file pdf". The combination of OCR'g, optimizing and saving as reduced file will very often reduce the file size by 75% or more. I just did this on a file that originally was 7MB and is now 1.6MB.

I'm confused about this and having problems with some scanned notes being OCR'd and some not... how do I "let" Acrobat do this? and where does abbyy fine reader enter in?  Also when you san have OCR set in ScanSnap, are you referring to the screen where there is a check box for "Convert to Searchable PDF"?

 

 

See comments here https://discussion.evernote.com/topic/62804-some-notes-not-ocrd/?p=287948 for more on this - OCR has been discussed a lot in the forums.  And yes,  "convert to searchable" means OCR.  Don't tick that box if you want to scan faster and batch OCR at the end of the session.

  • Like 1

Share this post


Link to post

Thanks for the suggestion,  but by definition 'online OCR' means uploading your file to a website.  There's some security risk with everything sent to the web,  and while adding the same file to Evernote also means uploading it,  doing so twice just doubles the risk.  For some documents that may not matter - but I still prefer to do my OCR on my own hard drive.

Share this post


Link to post

×
×
  • Create New...