(Archived) Feature Request: Manual correction of OCR image data

C.Noize · June 8, 2010

Hi!

I recently noticed, that with an increasing number of images in my notes i get more and more false hits on text searches. Even though the OCR is very very good it's not perfect and often it associates wrong words with a text in an image. It would be nice if i could manually declare a recognition as false by right clicking the yellow marked text and clicking something like "selection does not equates search term".

The only thing for you to do would be to delete the word out of the image OCR database and resync it. (I hope that's as easy as it sounds. )

ruudhein · June 9, 2010

Neat idea

jbenson2 · June 9, 2010

Yes, that would be nice.

I've seen this happen frequently with screen grabs of maps (with lots of small faint city names) for my long road trips.

What I do as a work around is to assign a tag X to each image that has a false hit.

When I want more accurate results, I run a search which includes -tag:X

ruudhein · June 9, 2010

I was thinking about that before my reply but am unsure how that would work with multiple words.

If the image says "click here" and it comes up for the search "home", you'd add the X tag. But for a refined search that would exclude the image also for the (correct) 'click'

Maybe adding the misidentified word preceded by x could also help then?

xhere

Refined search:

here -xhere

KTK_NJ · February 16, 2011

I just found this thread, and I agree that the ability to manually correct OCR would be very helpful. I recently searched for a word (can't remember which word) and wound up with a photograph of a knitting pattern in my results - somehow the OCR interpreted the knitting stitches as letters.

jefito · February 16, 2011

For images, the OCR data can be seen, and edited, by exporting to .enex format, and looking for the section. Each recognized (erroneously or not) piece of text is an element, with pixel 'x', 'y', 'w' (width) and 'h' (height) coordinate attributes; nested inside at the candiate words, represented by elements. You could edit an exported .enex file and remove bad OCR guesses, or even, for extra credit, add your own items, then import it back in.

This would be a workaround with the emphasis on 'work'.

anghammarad · October 16, 2011

…This would be a workaround with the emphasis on 'work'.

Well put Jeff - and I'm impressed at the observation! But yes, it's a prohibitive process. I for one look forward to having it in the GUI!

…Although I might just get in there once and fix something manually: you, machine, are pretty creative reading "toner" into that tablecloth pattern…

I'm seeing this as the sorta 'top result' because it's most recent in this sort 'by date created' view. At least this annoyance brought me here tonight, aye?

(Archived) Feature Request: Manual correction of OCR image data

Recommended Posts

C.Noize 2

Link to comment

ruudhein 29

Link to comment

jbenson2 2,147

Link to comment

ruudhein 29

Link to comment

KTK_NJ 0

Link to comment

jefito 5,589

Link to comment

anghammarad 0

Link to comment

Archived

Community Resources