radial9 1 Posted January 14, 2017 Share Posted January 14, 2017 Did some searching here and did not find answer to this question. My understanding is that the pdf OCR EN uses to store docs on their servers is proprietary. If in the future I download a doc back into my PC from my notes the EN OCR is stripped making the pdf doc unsearchable. Question: if I first run the doc through SnapScan Organizer, which places OCR with the pdf doc, does that OCR “stay” with the doc through its journey to the EN servers AND back down in the future or do the EN servers remove the Snapscan ABBYYY pdf OCR and replace it with theirs? Thanks for any insight here. Link to comment
Level 5* gazumped 12,066 Posted January 14, 2017 Level 5* Share Posted January 14, 2017 Hi. In my experience it's generally better to OCR your own documents for a number of reasons including that one. If you OCR a document - ie create a searchable PDF file - it will remain searchable whether or not it is attached to a note. You also know it has been done - the server-based OCR can take a little while, and by definition has to be done on the server; so your search capability is delayed by at least two syncs, maybe more if the OCR server is busy. there are some restrictions* on what can be OCR'd - file size/ number of pages/ searchable content; it may not make the cut. * the list: Filesize 50MB or less, 100 pages or less No “searchable” text - characters that can be selected and copied No encryption Not a handwritten document. Image OCR: max size 3000 x 2400px at 300 DPI. Hope that helps... Link to comment
Level 5* DTLow 5,744 Posted January 14, 2017 Level 5* Share Posted January 14, 2017 13 hours ago, radial9 said: Did some searching here and did not find answer to this question. My understanding is that the pdf OCR EN uses to store docs on their servers is proprietary. If in the future I download a doc back into my PC from my notes the EN OCR is stripped making the pdf doc unsearchable. Question: if I first run the doc through SnapScan Organizer, which places OCR with the pdf doc, does that OCR “stay” with the doc through its journey to the EN servers AND back down in the future or do the EN servers remove the Snapscan ABBYYY pdf OCR and replace it with theirs? Yes - if you OCR your documents, it does not get removed when filing in Evernote The Evernote process is documented here https://blog.evernote.com/tech/2013/07/18/how-evernotes-image-recognition-works/ and includes images, pdfs and handwriting This is used for search purposes, and is not suitable for converting your documents to text I would turn on your scanner's OCR function. I don't have that option, and I'm not missing it; however if/when I stop using Evernote, I'll have to run a mass OCR processing Link to comment
persistentone 11 Posted March 6, 2017 Share Posted March 6, 2017 Okay, I am having problems already with the Evernote OCR. It is refusing to index some PDF files that contain reports as images, and I don't think there is any clear way to investigate why. When I select the Note with the PDF I want indexed, and I go to the Windows app function Note | Word and Resource Counts, EverNote says I have one PDF document and zero PDF documents with Evernote OCR. I see no action to manually try to force the document to go through OCR. Very frustrating. If the feature does work, at very least the error checking and feedback to the user is terrible. I am ready to go with @gazumped's recommendation to use external OCR. What I really want is an application that will preserve the existing images in the PDF file, and then create an index against the words it finds in those images and store that invisibly in the background. I have one PDF application PDF Architect 5 that creates the OCR output as a new PDF file, and it mangles enough of the appearance that I won't accept that as the solution. My documents are technical and have lots of long technical acronyms in them. I need absolutely first-rate OCR quality, but it is also a competitive market with lots of OCR tools, and I would prefer to be below $100. What are some good options? I am using Windows only, but if it matters my correspondents often use MacOS. So whatever OCR format is used, it should be something that would work with any ordinary Acrobat reader. And it goes without saying that the OCR data needs to be correctly readable by EverNote, because in the big picture I want the search terms in each PDF to get into the EverNote index of the Notebook. Link to comment
Level 5* DTLow 5,744 Posted March 6, 2017 Level 5* Share Posted March 6, 2017 13 hours ago, persistentone said: Okay, I am having problems already with the Evernote OCR. It is refusing to index some PDF files that contain reports as images I'm wondering about your description of "reports as images" Do you really have report image files inside a pdf; that's unusual Can you post a sample of a report so we can test it ourselves? The OCR/Index process is documented here http://blog.evernote.com/tech/2013/07/18/how-evernotes-image-recognition-works/ Here's an example of an inexpensive external OCR app you can try outhttp://solutions.weblite.ca/pdfocrx/ It has a free trial edition you can try out Link to comment
persistentone 11 Posted March 7, 2017 Share Posted March 7, 2017 13 hours ago, DTLow said: I'm wondering about your description of "reports as images" Do you really have report image files inside a pdf; that's unusual Can you post a sample of a report so we can test it ourselves? The OCR/Index process is documented here http://blog.evernote.com/tech/2013/07/18/how-evernotes-image-recognition-works/ Here's an example of an inexpensive external OCR app you can try outhttp://solutions.weblite.ca/pdfocrx/ It has a free trial edition you can try out I am scanning paper documents from a Brother printer/scanner, and the Brother software creates a PDF file from those scans. There is no option in that software to use OCR when creating the PDF. In the PDFs that are created from the scanner, you cannot select text. You cannot search text. What else can this be but a scanned image of a printed page, inserted into the PDF file? What is the better description for me to use for this case? How can I investigate the content of a PDF file, to learn what image format - for example - is being used for each stored page, when text and font and page description information is not available? I will try out the tool you are recommending. That looks like the kind of function I want. I wonder how accurate the OCR would be from a product produced by such a small company. I am willing to create a test Notebook for you and publish it to you. Is there a specific Evernote name I should use when sharing to you? Link to comment
Level 5* DTLow 5,744 Posted March 7, 2017 Level 5* Share Posted March 7, 2017 1 hour ago, persistentone said: I am scanning paper documents from a Brother printer/scanner, and the Brother software creates a PDF file from those scans. There is no option in that software to use OCR when creating the PDF. In the PDFs that are created from the scanner, you cannot select text. You cannot search text. What else can this be but a scanned image of a printed page, inserted into the PDF file? No problem. My scanner produces the same image PDFs. I was just concerned there were separate embedded image files. No need to share a Notebook. My process is to share an individual note by generating a public link and posting the url, for example https://www.evernote.com/shard/s10/sh/5c8c7251-4228-4457-ba44-44b3da2e6792/83c560e1aa802b91ce5d0a470404c44c You can also attach a file to your post in this forum Test PDF.pdf Link to comment
Level 5* DTLow 5,744 Posted March 7, 2017 Level 5* Share Posted March 7, 2017 13 hours ago, DTLow said: Here's an example of an inexpensive external OCR app you can try outhttp://solutions.weblite.ca/pdfocrx/ It has a free trial edition you can try out 37 minutes ago, persistentone said: I will try out the tool you are recommending. That looks like the kind of function I want. I wonder how accurate the OCR would be from a product produced by such a small company. I was satisfied with my initial review of this software, (ymmv) but I didn't do extensive testing Currently, I'm happy with the EN OCR/Search feature Someday, I may have to address this and do a mass batch OCR There's also the issue of handwriting conversion Link to comment
persistentone 11 Posted March 7, 2017 Share Posted March 7, 2017 5 hours ago, DTLow said: I was satisfied with my initial review of this software, (ymmv) but I didn't do extensive testing Currently, I'm happy with the EN OCR/Search feature Someday, I may have to address this and do a mass batch OCR There's also the issue of handwriting conversion The weblite PDF/OCR on Windows is triggering Microsoft's Smartscreen anti-virus. Smartscreen refuses to let it run without an admin password, and naturally Microsoft doesn't give me any information at all about what their reasoning for this was. Any idea what is going on there? Link to comment
Level 5* gazumped 12,066 Posted March 7, 2017 Level 5* Share Posted March 7, 2017 I routinely batch OCR my PDFs. I scan to folder on my hard drive, edit / rename files as necessary, then bulk OCR everything and drop files to an Import Folder which sucks them into Evernote. I prefer this process because I know the PDFs are scanned and searchable immediately they're uploaded, I can deal with any errors when they occur, and I can process files that fall outside Evernote's comfort zone for OCR - including ones bound for local notebooks. My Fujitsu scanner box included both ABBYY and Adobe - the scanner using ABBYY for its automatic OCR. My software of choice is Adobe however - converting the graphics of scanned pages into characters gives a substantial file size saving, which can be important when using Basic accounts. Adobe seems to perform best for me with around a 40% size reduction. The choice of PDF software is pretty wide - if used manually, it's an open choice - see what's available to fit your budget: http://www.toptenreviews.com/business/software/best-ocr-software/ Link to comment
Level 5* DTLow 5,744 Posted March 7, 2017 Level 5* Share Posted March 7, 2017 5 hours ago, persistentone said: The weblite PDF/OCR on Windows is triggering Microsoft's Smartscreen anti-virus. Smartscreen refuses to let it run without an admin password, and naturally Microsoft doesn't give me any information at all about what their reasoning for this was. Any idea what is going on there? Sorry, no info I'm on a Mac, and don't even use the software much. I wouldn't waste time on it - check out the alternatives in @gazumped's post Link to comment
Level 5 Dave-in-Decatur 4,006 Posted March 7, 2017 Level 5 Share Posted March 7, 2017 I use ABBYY FineReader, which does a great job of converting image-PDFs into text (savable as PDF or RTF) in multiple languages. It's the automatic indexing, though, that @persistentone desires that seems to me to be the sticking point. Is there software that does automatic indexing? Sounds very handy if there is! Link to comment
Level 5* gazumped 12,066 Posted March 7, 2017 Level 5* Share Posted March 7, 2017 20 minutes ago, Dave-in-Decatur said: Is there software that does automatic indexing? Hmmn. I read that as just providing an search index of the content of a PDF - another post somewhere recently asked about more complicated stuff like preparing an index of source material, and I believe there's a whole range of academic software that does a far more comprehensive analysis of metadata and text than 'just' OCR. I've used something called Docear if you want to take a look...http://www.sciplore.org/projects/docear/ ABBYY and Adobe of course add another layer to the PDF file after OCR which consists of the actual text content, so the searchable file can be downloaded and shared as necessary. The Evernote server-based OCR just keeps a local index in the note which is not downloaded with the main document. It used to be possible to download that as well as, or instead of the original PDF. Don't know if that's still possible. Link to comment
Level 5* DTLow 5,744 Posted March 7, 2017 Level 5* Share Posted March 7, 2017 3 hours ago, Dave-in-Decatur said: It's the automatic indexing, though, that @persistentone desires that seems to me to be the sticking point. Is there software that does automatic indexing? Sounds very handy if there is! I like the indexing in Evernote, but if your pdfs are ocr'd I don't think the indexing software is a concern. I could leave the pdf file in my Mac OS and use the Finder search feature Link to comment
persistentone 11 Posted March 9, 2017 Share Posted March 9, 2017 As long as we are on the topic of doing OCR on a PDF using a tool external to EverNote: is there any reader that will let me examine the full text associated with the image so that I can edit out the mistakes? Link to comment
Level 5* gazumped 12,066 Posted March 9, 2017 Level 5* Share Posted March 9, 2017 12 hours ago, persistentone said: As long as we are on the topic of doing OCR on a PDF using a tool external to EverNote: is there any reader that will let me examine the full text associated with the image so that I can edit out the mistakes? I used a version of ABBYY a while ago that would present an OCR'd document with highlights on any unrecognized words so the output could be edited where necessary. Don't know if that feature still exists - I think you'd have to check out the top few apps on your prospective purchase list to confirm what degree of 'training' (for the recognition software) is available. It should be possible to add generic medical or scientific terms as additional libraries too. Link to comment
Level 5* gazumped 12,066 Posted March 9, 2017 Level 5* Share Posted March 9, 2017 12 hours ago, persistentone said: As long as we are on the topic of doing OCR on a PDF using a tool external to EverNote: is there any reader that will let me examine the full text associated with the image so that I can edit out the mistakes? I used a version of ABBYY a while ago that would present an OCR'd document with highlights on any unrecognized words so the output could be edited where necessary. Don't know if that feature still exists - I think you'd have to check out the top few apps on your prospective purchase list to confirm what degree of 'training' (for the recognition software) is available. It should be possible to add generic medical or scientific terms as additional libraries too. Link to comment
Level 5* gazumped 12,066 Posted March 9, 2017 Level 5* Share Posted March 9, 2017 12 hours ago, persistentone said: As long as we are on the topic of doing OCR on a PDF using a tool external to EverNote: is there any reader that will let me examine the full text associated with the image so that I can edit out the mistakes? I used a version of ABBYY a while ago that would present an OCR'd document with highlights on any unrecognized words so the output could be edited where necessary. Don't know if that feature still exists - I think you'd have to check out the top few apps on your prospective purchase list to confirm what degree of 'training' (for the recognition software) is available. It should be possible to add generic medical or scientific terms as additional libraries too. Link to comment
persistentone 11 Posted March 10, 2017 Share Posted March 10, 2017 10 hours ago, gazumped said: I used a version of ABBYY a while ago that would present an OCR'd document with highlights on any unrecognized words so the output could be edited where necessary. Don't know if that feature still exists - I think you'd have to check out the top few apps on your prospective purchase list to confirm what degree of 'training' (for the recognition software) is available. It should be possible to add generic medical or scientific terms as additional libraries too. That's an interesting feature but that is different than what I am suggesting. I want to edit the entire full-text document, just so I can correct mistakes made by the software for terms it think it got right but actually got wrong. I understand your point I need to look around. I am hoping someone has already identified the software and will respond on this thread some day in the future. Link to comment
Level 5 Dave-in-Decatur 4,006 Posted March 10, 2017 Level 5 Share Posted March 10, 2017 10 hours ago, persistentone said: That's an interesting feature but that is different than what I am suggesting. I want to edit the entire full-text document, just so I can correct mistakes made by the software for terms it think it got right but actually got wrong. I understand your point I need to look around. I am hoping someone has already identified the software and will respond on this thread some day in the future. I think ABBYY FineReader would do what you want. You can learn about its features and download a trial copy at https://www.abbyy.com/en-us/finereader/. Link to comment
Level 5* DTLow 5,744 Posted March 11, 2017 Level 5* Share Posted March 11, 2017 On 2017-03-09 at 7:55 PM, persistentone said: I want to edit the entire full-text document, just so I can correct mistakes made by the software for terms it think it got right but actually got wrong. Check out http://help.abbyy.com/FineReader/FineReader12/English/CheckResults/StepChecking.htm Once the OCR process is complete, the recognized text appears in the Text window. The characters recognized with low confidence will be highlighted, so that you can easily spot the OCR errors and correct them. You can edit recognized texts either directly in the Text window or in the Verification dialog box (click Tools > Verification… to open the dialog box). In the Verification dialog box, you can review low-confidence words, correct spelling errors, and add new words to the user dictionary. Link to comment
persistentone 11 Posted March 12, 2017 Share Posted March 12, 2017 19 hours ago, DTLow said: Check out http://help.abbyy.com/FineReader/FineReader12/English/CheckResults/StepChecking.htm Once the OCR process is complete, the recognized text appears in the Text window. The characters recognized with low confidence will be highlighted, so that you can easily spot the OCR errors and correct them. You can edit recognized texts either directly in the Text window or in the Verification dialog box (click Tools > Verification… to open the dialog box). In the Verification dialog box, you can review low-confidence words, correct spelling errors, and add new words to the user dictionary. That looks great. It costs about 300% what a Premium account on EverNote costs, so it is not cheap. But I have bookmarked it and I may have no choice. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.