Jump to content

(Archived) Handwriting in PDFs on Windows 7


Recommended Posts

I've spent a fair bit of time researching PDF OCR in evernote and I came to the conclusion that it only OCRs handwriting in pictures, and text in PDFs.  

 

http://discussion.evernote.com/topic/23940-my-scanned-pdfs-dont-get-ocrd-by-en/?p=126567

 

I see what you are saying. I hope that is not true. Because that is really, really bad programming. Wow. Almost unbelievably bad. That's like having a calendar app that can post from your desktop to your phone, but not from a web client.

 

Anybody have any other suggestions/advice on how to get handwritten pdfs ocr'ed?

Link to comment
  • 2 weeks later...
  • Level 5*

Well.  I'd agree that Evernote seems only to OCR JPG and PNG files for handwriting - and that's not "bad programming" that's an additional feature.  Also there are additional rules about the size of files and the number of pages to OCR that may or may not have applied to your original notebook files. 

 

Even if the files were OCR'd,  the rules applying to handwriting are different to preformatted fonts.  "House" forinstance can be horse,  hearse,  hands or mouse - there's a tree of possible interpretations,  not a narrative text.  Searches on content would be - interesting.

 

The most efficient way to be sure that content is OCR'd is to do it yourself.  Most PDF editors will do that for you.  If you save an OCR'd file,  Evernote will add it to the indexes immediately.

 

So - you could have a go at OCR-ing the files yourself if they haven't been done by now;  or you could export the pages of your PDF files to JPG (most PDF editors...) and re-save them into the current notes.

Link to comment

Well.  I'd agree that Evernote seems only to OCR JPG and PNG files for handwriting - and that's not "bad programming" that's an additional feature.  Also there are additional rules about the size of files and the number of pages to OCR that may or may not have applied to your original notebook files. 

 

 

--As a professional software engineer, I'd have to disagree. Especially since, at the time I purchased the premium service, I was purchasing it SO THAT it would OCR my PDFs. In other words, even though OCR is a cool, even amazing feature, it is not an additional feature if I buy your product because of it. For me, it was the central feature.

 

Even if the files were OCR'd,  the rules applying to handwriting are different to preformatted fonts.  "House" forinstance can be horse,  hearse,  hands or mouse - there's a tree of possible interpretations,  not a narrative text.  Searches on content would be - interesting.

 

--I'm not nearly as interested in whether the OCR is perfect or not. I expect it to be good, because I'm paying for the product, but I don't expect anything like perfection (especially with my writing).

 

The most efficient way to be sure that content is OCR'd is to do it yourself.  Most PDF editors will do that for you.  If you save an OCR'd file,  Evernote will add it to the indexes immediately.

 

-- Again, I bought Evernote because I wanted my content OCR'd. I didn't buy evernote so that I could buy another product to OCR my PDFs. And how is it more efficient to use two products to do one thing (especially when one of them is an advertised feature)?

 

So - you could have a go at OCR-ing the files yourself if they haven't been done by now;  or you could export the pages of your PDF files to JPG (most PDF editors...) and re-save them into the current notes.

 

--Again, I don't want individual JPGs. I want OCRd PDFs. For one, you can't do text select out of JPEGs, you can just search them which makes them useless for me in most of my use cases. For another, I can read a PDF on my kindle, computer, whatever easily. Photo viewers are not made for reading 5 page or 30 page documents. Now, what I mean by poor programming is that IF Evernote as a corporation has the software to OCR the handwriting in images, they also have the software to OCR PDFs. I can see what the technical difficulties would be, especially with editing PDFs, but they are not insurmountable or OpenOffice and Microsoft Word wouldn't be able to export to PDF (with OCR). Which they can. And OpenOffice is open source for heaven's sake.

I'm not saying that Evernote is a terrible company. And I'm happy that their website now reflects that only JPGs are OCRd. I'm happy with most things that they do. But this is a key, vital oversight for everyone trying to capture large volumes of information. For example, social workers: Just load all 300 pages of notes on this ONE child abuse case as individual IMAGES, and then TAG them with the specific case, and the source, and the people that wrote them, and then hope they never get out of order. That's like throwing everything into a shoebox. Then you can search the shoebox and type out any notes you find again by hand. Not a good solution. A PDF at least is like using a file folder. And if I'm going to buy OCR software, why even bother having a premium subscription that provides OCR in Evernote in the first place?

Link to comment

 

I see what you are saying. I hope that is not true. Because that is really, really bad programming. Wow. Almost unbelievably bad.

No, it's not bad programming.

http://discussion.evernote.com/topic/29830-ocr-confusion/?p=160751

 

As a software engineer (and as a consumer), I would say that not providing a feature you advertise is bad programming.  Though to be fair, Evernote now makes clear that only JPGs are searchable. Please see below for a more indepth response.

Link to comment

As a software engineer (and as a consumer), I would say that not providing a feature you advertise is bad programming. Though to be fair, Evernote now makes clear that only JPGs are searchable. Please see below for a more indepth response.

"Not providing a feature you advertise" has absolutely nothing to do with programming. And please post where you think EN is advertising something they are not providing.

There is nothing "below".

And to clarify, images (not just JPGs) AND PDFs are searchable, but PDFs are a premium feature. The fact that images use a different indexing technology than the PDFs is a good thing b/c that is more appropriate b/c images are not text. They are just a bunch of pixels & simply cannot be indexed in the same way text can.

Link to comment
  • Level 5*

One of the problems with OCR of handwritten text is that handwriting varies tremendously from person to person. As a pharmacist I used to see some extremely bad handwriting from physicians/nurses that was hard for my eyeball to interpret what they actually meant. Because computers rely on clarity of input, the handwriting conundrum would be very difficult to resolve to everyone's satisfaction given that variation. That is not bad programming - just recognition of the challenges of true OCR with handwriting. Requiring users to always input text with a standard format for characters would also be bad programming.

 

I agree that if you have thousands of pages of text to scan it can be a problem. However, after years of working with pdfs, requiring other people to read what I have written and make changes to a master file, I have come to conclusion that the only way to do this is with text boxes and typed information. That way, our graphics people can cut and paste exactly what I said, and everyone knows what I said. From a legal perspective, that might be challenging if I mis-speak, but for most situations it does work.

 

An alternative, if you have a single page, may be to write your notes within the note after you have attached the pdf. Again, it is perfectly clear what you said, and the text becomes fully searchable with Evernote's very powerful search tools. You can also use BitQwik to help refine the search capabilities even further.

Link to comment
  • Level 5*

My handwriting is so bad it's toxic,  so I'm always impressed when Evernote manages to get some of the content.  I always add extra typed information to handwritten material - explanatory titles,  tags and keywords at minimum;  I often do a summary of what's in the content "notes from a meeting with x, y and z about <something> in the lab June 2013.. agreed to:  etc" and when I'm trawling through notes and find an uncommented piece of handwritten text I'll add the details.

Link to comment

My handwriting is so bad it's toxic,  so I'm always impressed when Evernote manages to get some of the content.  I always add extra typed information to handwritten material - explanatory titles,  tags and keywords at minimum;  I often do a summary of what's in the content "notes from a meeting with x, y and z about <something> in the lab June 2013.. agreed to:  etc" and when I'm trawling through notes and find an uncommented piece of handwritten text I'll add the details.

 

 

I also add significant text to notes.  To a fault.  I find myself often adding "keywords" that are already text in the note.  :P   Simply b/c when creating a note, I try to anticipate what words I may use when looking this note up.  It's second nature now.  One "tip" I've used for years & posted about several times on the old board is I even add "misspellings" as keywords.  IE, if I create a note about the contact information for Joe Shafer, I will include keywords of Shaffer b/c I'm sure I won't remember if Joe had one or two Fs in his name.  Adding this "misspelling" allows me to find this note in one search.  In a similar vein, if I often confuse words, I will add those in for keywords as well.  IE, I rarely need to set the date and/or time on our work computers.  But every now & then I do.  (Like once in about every five years.)  When I need to do so, I tend to think of "setdate" b/c that's now the date is set in our legacy software & it's just something I remember.  But if I need to set the unix computer date, the command is 'date'.  SO...in my note reminding myself how to set the unix date, I have a keyword 'setdate'.  This allows me to find the note when either I don't remember the unix command OR I want to filter it from a bunch of false positives that may include the words 'date' and 'unix'. 

Link to comment

 

As a software engineer (and as a consumer), I would say that not providing a feature you advertise is bad programming. Though to be fair, Evernote now makes clear that only JPGs are searchable. Please see below for a more indepth response.

"Not providing a feature you advertise" has absolutely nothing to do with programming. And please post where you think EN is advertising something they are not providing.

There is nothing "bel

And to clarify, images (not just JPGs) AND PDFs are searchable, but PDFs are a premium feature. The fact that images use a different indexing technology than the PDFs is a good thing b/c that is more appropriate b/c images are not text. They are just a bunch of pixels & simply cannot be indexed in the same way text can.

 

By below, I meant above, I guess in this context, when I replied to gazumped's more detailed and thoughtful response.

I am going to go ahead and stand by my original statement that not OCR'ing handwriting in PDFs is a bad idea. Perhaps spending a bit of time wiki'ing PDF would be helpful for you to understand why "different indexing technology" is not a good thing.

Link to comment

One of the problems with OCR of handwritten text is that handwriting varies tremendously from person to person. As a pharmacist I used to see some extremely bad handwriting from physicians/nurses that was hard for my eyeball to interpret what they actually meant. Because computers rely on clarity of input, the handwriting conundrum would be very difficult to resolve to everyone's satisfaction given that variation. That is not bad programming - just recognition of the challenges of true OCR with handwriting. Requiring users to always input text with a standard format for characters would also be bad programming.

 

I agree that if you have thousands of pages of text to scan it can be a problem. However, after years of working with pdfs, requiring other people to read what I have written and make changes to a master file, I have come to conclusion that the only way to do this is with text boxes and typed information. That way, our graphics people can cut and paste exactly what I said, and everyone knows what I said. From a legal perspective, that might be challenging if I mis-speak, but for most situations it does work.

 

An alternative, if you have a single page, may be to write your notes within the note after you have attached the pdf. Again, it is perfectly clear what you said, and the text becomes fully searchable with Evernote's very powerful search tools. You can also use BitQwik to help refine the search capabilities even further.

Thank you. Those are helpful ideas. But I'm not complaining about Evernote's OCR technology being bad or not recognizing my handwriting. I don't expect anything like perfect results from that. What I am speaking about is the fact that PDFs aren't OCRed at all.

In particular, as someone who attends professional conferences, I often have handwritten notes taken in or around important typed material.

Link to comment

My handwriting is so bad it's toxic,  so I'm always impressed when Evernote manages to get some of the content.  I always add extra typed information to handwritten material - explanatory titles,  tags and keywords at minimum;  I often do a summary of what's in the content "notes from a meeting with x, y and z about <something> in the lab June 2013.. agreed to:  etc" and when I'm trawling through notes and find an uncommented piece of handwritten text I'll add the details.

That's a helpful idea. Thanks for the suggestion. I will probably do that in addition ... but my hope is that my PDFs will be OCRed, and then (as OCR technology advances over the next 5 - 15 years) the OCR will gradually get better. :D

Link to comment
  • 4 months later...

 

As a software engineer (and as a consumer), I would say that not providing a feature you advertise is bad programming. Though to be fair, Evernote now makes clear that only JPGs are searchable. Please see below for a more indepth response.

"Not providing a feature you advertise" has absolutely nothing to do with programming. And please post where you think EN is advertising something they are not providing.

There is nothing "below".

And to clarify, images (not just JPGs) AND PDFs are searchable, but PDFs are a premium feature. The fact that images use a different indexing technology than the PDFs is a good thing b/c that is more appropriate b/c images are not text. They are just a bunch of pixels & simply cannot be indexed in the same way text can.

 

 

Ok - I want to wade into this as a Mac (and iPad and iPhone) Premium user who wants and needs this exact same thing that the OP asked for.

 

This response from BurgersNFries appears to confuse the issue.

 

There are actually TWO kinds of PDF's: Those that are text (and vector) based, and those that have images embedded.

 

BurgersNFries is referring to the first kind. For that kind of PDF, her response makes sense.

 

However, the kind that the OP and I both need to have indexed is the second kind: PDF's with images of handwritten documents embedded.

 

There is no reason that Handwriting Recognition (HR) on an image embedded in an PDF should be any more difficult or different than HR on a JPG or PNG. It seems like a totally arbitrary distinction.

 

But with one important difference for the end user. If I scan my many pages of handwritten notes as images, then they are just loose collections of image files with no structure. Then I must rely on Evernote to "keep them together" and if that fails for any reason, I'm totally screwed.

 

Whereas a PDF nicely bundles the image files together into one package. That's why I use it. That's why most of the scanning apps I use prefer to output PDFs. That's the same reason the OP was using it, and at least one other poster here.

 

So I'll re-ask the original question that the OP posted: why doesn't Evernote do HR on these PDFs? If I don't find a solution in Evernote, I will promptly get all my info out of it and find another solution (or go back to the desktop-based program that I was using).*

 

 

* And, no, the solution that another person posted here of exporting everything, converting to images or doing manual HR, then re-importing into Evernote is NOT AN OPTION. I'm using Evernote to save time, not to waste it.

Link to comment

 

 

As a software engineer (and as a consumer), I would say that not providing a feature you advertise is bad programming. Though to be fair, Evernote now makes clear that only JPGs are searchable. Please see below for a more indepth response.

"Not providing a feature you advertise" has absolutely nothing to do with programming. And please post where you think EN is advertising something they are not providing.

There is nothing "below".

And to clarify, images (not just JPGs) AND PDFs are searchable, but PDFs are a premium feature. The fact that images use a different indexing technology than the PDFs is a good thing b/c that is more appropriate b/c images are not text. They are just a bunch of pixels & simply cannot be indexed in the same way text can.

 

 

Ok - I want to wade into this as a Mac (and iPad and iPhone) Premium user who wants and needs this exact same thing that the OP asked for.

 

This response from BurgersNFries appears to confuse the issue.

 

There are actually TWO kinds of PDF's: Those that are text (and vector) based, and those that have images embedded.

 

BurgersNFries is referring to the first kind. For that kind of PDF, her response makes sense.

 

However, the kind that the OP and I both need to have indexed is the second kind: PDF's with images of handwritten documents embedded.

 

There is no reason that Handwriting Recognition (HR) on an image embedded in an PDF should be any more difficult or different than HR on a JPG or PNG. It seems like a totally arbitrary distinction.

 

But with one important difference for the end user. If I scan my many pages of handwritten notes as images, then they are just loose collections of image files with no structure. Then I must rely on Evernote to "keep them together" and if that fails for any reason, I'm totally screwed.

 

Whereas a PDF nicely bundles the image files together into one package. That's why I use it. That's why most of the scanning apps I use prefer to output PDFs. That's the same reason the OP was using it, and at least one other poster here.

 

So I'll re-ask the original question that the OP posted: why doesn't Evernote do HR on these PDFs? If I don't find a solution in Evernote, I will promptly get all my info out of it and find another solution (or go back to the desktop-based program that I was using).*

 

 

* And, no, the solution that another person posted here of exporting everything, converting to images or doing manual HR, then re-importing into Evernote is NOT AN OPTION. I'm using Evernote to save time, not to waste it.

 

Thank you. :D That was much more helpful than my sarcasm. I also really, really appreciate and agree with the last sentence, "I'm using Evernote to save time, not to waste it" !!!!!!! Once Evernote starts wasting my time, I'm done. Windows and Google (among others) are really good at search too.

Link to comment
  • Level 5*
So I'll re-ask the original question that the OP posted: why doesn't Evernote do HR on these PDFs? If I don't find a solution in Evernote, I will promptly get all my info out of it and find another solution (or go back to the desktop-based program that I was using).

 

I agree that it would be really convenient if Evernote could HR my handwritten notes,  but sadly I've found that a substantial portion of humanity (including myself at times) has difficulty reading my scrawl.  I find that my Mk 1 eyeball translations are usually pretty good however,  hence my 'other' suggestion that good titles and copious keywords make notes easy to find - then I can eyeball the content to find what I need.much more efficiently than trying to structure a search to jump to the exact page.

 

Not 'defending' Evernote as such,  but they've never claimed to offer an all-encompassing HR/ OCR service (don't think anyone does) so this is not a bug,  it's simply one (of many) features that they haven't gotten around to yet.  Based on performance to date I'd say Evernote is our best chance to get enhanced services like this - meantime there's workarounds.  I'm sure the discussion will have been noted.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...