Jump to content

(Archived) ScanSnap PDF's: searchable or non-searchable?


Recommended Posts

  • Level 5

I ran a test with Evernote to check out the differences between the ScanSnap searchable and non-searchable PDF's. I am a premium user of Evernote, so my PDF's are automatically converted to searchable by EN.

I scanned a magazine page with mostly text and a small color photo

The ScanSnap Settings were:

* Send to ScanSnap Organizer

* Normal Image Quality

* Auto Color Detection

* Single Sided Scan

The non-searchable PDF (size 303 KB) took 11 seconds to scan

The searchable PDF (size 342 KB) took 43 seconds to scan and convert

Added to Evernote at 7:25 PM

manually synchronized at 7:27 PM - could only find the search word on the searchable PDF

manually synchronized again at 7:30 PM - found the search word on both PDF's

If I have several pages to scan, the non-searchable scan looks like the way to go.

11 seconds vs 43 seconds

Link to comment

Hello

Can you check the size of each PDF ? I am using another scanner, so I am not sure for you, but for me the searchable is about 8 time smaller. So I came to the other choice, take the searchable, which is taking less time to upload, and less time to display in EN.

Link to comment
  • Level 5

Your searchable version is 8 times smaller than the non-searchable? That is amazing.

I ran a few other tests with my Fujitsu ScanSnap. The searchable scan/convert time is usually twice the time as the non-searchable version. The size difference is negligible.

A Form with some handwriting

no search: 11 seconds - 34 KB

search: 24 seconds - 51 KB

A Business Letter - all text with a small color logo

no search: 11 seconds - 135 KB

search: 19 seconds - 155 KB

A Small Photo

no search: 9 seconds - 44 KB

search: 20 seconds - 44 KB

With Evernote's premium feature of automatically converting my PDF's, I think I will stay with the non-searchable scan option.

Link to comment

If you make a Searchable PDF on your own computer, you are running the OCR image processing on your own system. This is where the extra time goes.

If you add a non-Searchable PDF to Evernote, we'll basically do the same work, but on our servers. So you won't "pay" for the local CPU, we'll take care of it.

The only drawback I can see to your approach is disk space ... Evernote stores both your original PDF and the searchable one that we produce. Since both of these go into your local EN database, you'll use up more hard drive space if you have us do the OCR rather than just processing it yourself before uploading.

Link to comment
  • Level 5
The only drawback I can see to your approach is disk space ... Evernote stores both your original PDF and the searchable one that we produce.

Whoa! That puts an entirely different spin on my analysis.

1.) Is there a way I can search for PDF's that are duplicated?

2.) If I do the OCR conversion locally, am I correct that there will only by one PDF stored.

Thanks

Link to comment

Every scanned PDF will create an alternate OCR version in the service if you are Premium, and if the PDF does not already contain OCR text.

So if you do the OCR first, your database will only contain your original copy. We won't OCR this PDF again.

(Note that we don't "charge" your monthly upload allowance for the searchable PDF we create ... this is just an internal detail to permit you to search your own PDFs. The storage cost I mentioned is just on your own personal hard drive.)

Link to comment

As for potential drawbacks of not doing local OCR, should we ever want to get our files out of evernote one day (unlikely, but one must at least consider this possibility), is it the case that your OCR version would not be exported with the original pdf? So local OCR would at least guarantee that OCR would always be available in future.

I was kinda hoping that you would still OCR pdfs that are already OCR'ed. Reason being that your OCR might be better than scansnap's, or at least provides a different take on the same data using a different algorithm.

Mike

Link to comment

You can always get both versions from either the Mac or Windows client via right-click ... we always keep your original PDF, and we may have an alternate OCR version if we processed it on the service.

Link to comment
  • Level 5
Aug 23 - So if you do the OCR first, your database will only contain your original copy. We won't OCR this PDF again.
Aug 24 - You can always get both versions from either the Mac or Windows client via right-click ... we always keep your original PDF, and we may have an alternate OCR version if we processed it on the service.

For Premium users who do the OCR locally, it appears the Aug 23 conflicts with the Aug 24 comment.

I'm a bit confused now. How do I (as a Premium user) avoid doubling all my PDF's in my local database?

Link to comment

If you perform OCR on your own computer to make a PDF that already has OCR, and then upload this to Evernote, we will see that your PDF already contains text. We won't do our own OCR, and therefore there will only be one PDF document.

If you don't perform OCR, and you send us a scan that doesn't contain any searchable text, we will perform OCR on our servers. This will create a second "alternate" PDF document which we will copy to your client. You will still have the original PDF in your database, plus a second searchable one that we created.

If you are scanning documents, you can choose to do the OCR yourself to save space on your computer, or let us do the OCR to save time/CPU on your computer.

Link to comment

Basically trying to avoid consuming extra bandwidth and storage on your computer just to add some potential false positives. Desktop OCR software does a pretty good job these days with clean scanned documents, and we figure you can always turn that off before you scan to Evernote if you'd prefer to use ours instead.

Link to comment

Question for existing scansnappers: when in OCR mode and scanning into evernote, can the scanning software multitask in the sense of allowing you to start a new scan while the previous scan is still being OCR'ed, or do you have to wait for OCR to complete before starting the next scan? in other words is the OCR time truly a timewaster in use, particularly when scanning a whole bunch of documents in quick succession?

Also, presumably the scansnap S1500 (which I understand replaces the S510) has the capability to scan direct into evernote? I would assume so.

Thanks,

Mike

Link to comment
  • Level 5

I've only had my ScanSnap for a few days, so I don't claim to be an expert.

Personally, I don't like to scan multiple documents at one time. But you can do that if you wish.

1.) With the ScanSnap Manager, set the Application to ScanSnap Organizer, then check the Continue scanning after current scan is finished (under Scanning), and turn off the SEARCHABLE PDF (under File option)

2.) After the documents are scanned, go into the ScanSnap Organizer and click on the Convert into SEARCHABLE PDF option (under the PDF file tab)

I scan individually so that I can review to the document in order to accurately name the PDF file.

Link to comment

OK so, for a stack of documents being scanned in a single sitting, to avoid sitting there watching the software perform OCR time and time again, one would scan first without OCR, then batch OCR, then drop the whole lot into evernote? There is no way of scanning direct into evernote (as I understand you can) for a bunch of documents in turn, without enduring the extra wait for each scan to perform OCR? Unless of course you let evernote do the OCR.

Mike

Link to comment
  • 3 years later...

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...