(Archived) The PDF file size mystery...

I posted somewhere here a while ago about a mysterious saving in file size when a PDF is OCR'd. I now have an answer, but can't find the original post (!) so FWIW here's the problem (again, sorry) and my current take on the answer.

I use a ScanSnap S1500 and scan to file so I can change filenames to my own system. My SSManager is also set to make the PDF searchable - ie to OCR the file on the fly. That slows down progress slightly, but when I'm 'batch' scanning, it also gives me just about enough time to look at the original so I get the saved name correct, and set up the next document for scanning. The file compression in SSManager is set to medium, which seems to give no problems and avoids mega-size files.

After an upgrade to SSManager some time ago the OCR process shortened considerably, and as a belt and braces test I decided to OCR a large saved file again to check that it had been properly processed the first time around. To my surprise the file size dropped by 10-15% after this second processing, which worried me even more - what had I broken?

(Here's a tip for free - if you're testing anything in this area, make sure you don't shred the original before you mess with the only electronic copy in your possession - or work with a copy of the file...)

Anyhow I've been lurking around Adobe fora trying to work out why the file size drops after a second OCR - meantime I've been doing that routinely on large files to save the extra overhead.

Come today I installed some new drivers on my system (I'm now up to 3 screens and loving it..) - and lost my ScanSnap. OK - reinstall that too. In course of which I reminded myself that the OCR process is run via the wonderfully named ABBYY Finereader, and not Adobe.

Although ScanSnap give you a copy of Acrobat 9.0 - maybe to partially justify the eyewatering price of the machine; they don't actually use Adobe to OCR the scan.

So when the searchable PDF file prepared by ABBYY goes through the Adobe mill, it gets ground a little smaller by software which is just that bit more compliant with the PDF Standard.

So the moral of the story is - Adobe seems to be the 'best' OCR software around in terms of producing a smaller, searchable PDF file. And I haven't, as has been my nagging worry for weeks, been systematically messing up the files I've been saving, or storing up a massive re-OCR problem for myself. I'll stop double-OCR-ing large files unless it's the end of the usage month and I'm having capacity problems, and just sit back and enjoy the view from my desk.

Or I suppose I could scan to file and then batch OCR all those files before I upload.. decisions, decisions


"Or I suppose I could scan to file and then batch OCR all those files before I upload.. decisions, decisions"

I do this. I want my own titles, and I prefer to run OCR when I am not doing anything else. The machine hums away doing its work and doesn't affect my scanning speed.

