(Archived) PDF/image OCR without umlauts

nafets · August 5, 2009

Hi

I searched the FAQ and the forum, but could not find information concerning this problem: the PDF and image OCR does not recognize umlauts ("german characters") like äöü. Your handwriting recognition already does. Therefore your OCR for PDFs and images is not very useful for me. Let me give you an example: "Kuchen" means cake, but "Küchen" means kitchen. Quite a difference, don't you think? ;-)

So, my question is: when will you support different OCR languages or at least recognize umlauts?

Regards

engberg · August 5, 2009

To produce decent results (without a lot of "false positives") our text recognition is always targeted for a specific primary language on each document. This determines the set of characters that are legitimate, and also determines the "dictionary" of words that we use to help us guess the right word against non-word sequences of characters.

Currently, we've only launched Evernote in English and in Russian. If your language preference is set to Russian, we'll recognize words in Russian and English in your documents. Otherwise, we'll recognize words in English, which means we'll only look for Latin characters, Roman numerals, etc. We won't try to interpret dots above the 'u' character as possible umlauts, we won't try to match the ess-tset character, etc.

We plan to launch Evernote in several other languages this year (http://blog.evernote.com/2009/07/29/eve ... n-program/), which will include both localization of the clients and also tuning of the text recognition for these languages. Once we've launched German support, you'll be able to specify that your primary language as German and get good recognition for German characters and words.

(Archived) PDF/image OCR without umlauts

Recommended Posts

nafets 0

Link to comment

engberg 89

Link to comment

Archived

Community Resources