Jump to content

Problem with Search, OCR and Legacy vs V10


Recommended Posts

I wonder if anyone has any suggestions on a problem/but I have with V10. I use legacy mainly, but regularly try V10 to see how its progressing. 

I've found a number of bugs which would prevent me using V10, but one in particular confuses me - and I wonder if anyone has any suggestions (BTW I have already raised a related support ticket, but no resolution yet).

I noticed a discrepancy in the number of search results between Legacy and V10. I finally narrowed it down to some search terms not being found in V10, when they were found in legacy. These related to finding text in images. For a given note, legacy indexed more terms in an image than V10, so if I searched for one of those terms, V10 would not return it. 

Firstly, I couldn't understand this behaviour, as surely legacy and V10 "see" the same OCR'd terms and should be equally searchable. The image in question was a newspaper cutting with a caption, as part of a much longer note. So, I copied just this one image to a new note. Within a few minutes, both legacy and V10 could find this new note, using the same search term as had failed in V10 before. I tried copying the full note, but as before,  legacy can find it, but v10 can't.

So, it looks like V10 is not "seeing" all the OCRd/indexed terms that legacy does, unless you create a completely new note (copying a note isn't enough).

Is there any way to "force" V10 to re-index the OCRd terms which seem correct in Legacy, as this could fix this problem ?

 

  • Like 1
Link to comment
34 minutes ago, FrankC said:

So, it looks like V10 is not "seeing" all the OCRd/indexed terms that legacy does, unless you create a completely new note (copying a note isn't enough).

This is interesting. It looks like EN has made a number of changes to the indexing of images. The first is that the note information window no longer records whether the images have been indexed. The second is that EN seems to have changed the way it records the indexing data. If you export a note from the legacy and open the enex file with a text editor, at the bottom you will see the words that have been found. There used to be an online tool that would do this from the weblink but that is no longer available. 

If on the other hand you export from V10 you get a smaller file which does not contain the indexing information.  The data must be somewhere (becasue V10 found the note and highlighted the search term in my image) but could be on the server or in the local sql database - it's certainly not being exported by V10.

My conclusion is that if there has been a change between the two systems then it is possible that some indexing has been lost in the process. I don't know a way of reindexing images. I would check on the web version because that will identify if the problem is simply synchronisation. I think it would be worth raising a support ticket because this is clearly not supposed to happen and so far I've not identified the problem with any of my notes.

  • Thanks 1
Link to comment
4 minutes ago, Mike P said:

My conclusion is that if there has been a change between the two systems then it is possible that some indexing has been lost in the process. I don't know a way of reindexing images. I would check on the web version because that will identify if the problem is simply synchronisation. I think it would be worth raising a support ticket because this is clearly not supposed to happen and so far I've not identified the problem with any of my notes.

Yes - there has to have been some difference - but what, and how to work around this ?

In reply to some of your specific points :

I don't think it's a sync issue as (a) it's a very old note and (b) it works if I cut/paste just  the image in question to a completely new note

I have already raised a support ticket, but no meaningful response yet.

You may not have identified this problem because you haven't noticed it yet. Try a few searches in legacy and V10 which include text in images. If you get different numbers of notes returned in the searches, that's probably this issue. If you just searched in one version - you may not have noticed the missing notes in the search results.

 

Link to comment
19 minutes ago, FrankC said:

If you just searched in one version - you may not have noticed the missing notes in the search results.

I have searched in both V10 and legacy and also downloaded the enex file from both versions. I've also looked for both old and new notes. I'm not saying I don't have the problem but I certainly haven't found it yet. 

22 minutes ago, FrankC said:

I don't think it's a sync issue as (a) it's a very old note and (b) it works if I cut/paste just  the image in question to a completely new note

Not sure why you don't want to check in the web version which will only take a few minutes. The age of the note is irrelevant if it is just refusing to sync or has synce'd it incorrectly. It works when you cut and paste the image because it reindexes it as a new note.

Link to comment
14 minutes ago, Mike P said:

I have searched in both V10 and legacy and also downloaded the enex file from both versions. I've also looked for both old and new notes. I'm not saying I don't have the problem but I certainly haven't found it yet. 

Not sure why you don't want to check in the web version which will only take a few minutes. The age of the note is irrelevant if it is just refusing to sync or has synce'd it incorrectly. It works when you cut and paste the image because it reindexes it as a new note.

Good to see that it's not happening for you. It's a confusing issue as it only affects some of the text in the image - other parts of the text are indexed - in all versions. It's possible many people will never get (or notice) the problem. My worry is that you wouldn't easily know there were missing notes in the search results, unless you try the same search in the old and new versions.

You're right of course - testing should be on all versions - even if you're "sure" it won't make a difference. Should have remembered that from my coding days.
So, I tried it on the web version but got the same result - same missing notes.
I was pretty sure the note had synced, because I could find it with other searches (not using the specific text-in-image which caused the problem).

I have already confirmed that creating a new note (with cut'n'paste) works, and also that just copying the note doesn't. So, the problem seems to be in something missing in the data stored regarding the note in V10 (the indexing data you refer to in your first reply). This has to be some sort of deficiency in V10 when it migrates/opes the database from the legacy version.


 

 

  • Like 1
Link to comment

Any problem is annoying when it isn't consistent and appears random. I'll keep watching but fingers crossed I seem to be OK. I am having problems searching for pdfs (not indexed ones) which I will report elsewhere.

  • Like 1
Link to comment

Oups, aren't PDFs and images indexed on server side (only)? If so (as it is/was my understanding), using EN-10 should not show differences. If it does (and it does if I'm looking for 😉), it might be a sync issue...

I tried the following:

  • I've an image, that contains "offizielle Abgabefrist" in two notes:
    • both (Legacy and EN-10 on Windows) find the notes
  • I've copied this image to an other note in EN-10:
    • EN-10 does not find the newly extended note
  • After a sync in Legacy (even nearly immediately after the copy operation in EN-10)
    • Legacy finds the newly extended note
  • (A) Even after 10 minutes (and some other change in other notes that have been synced correctly to Legacy)
    • EN-10 does not find the newly extended note
  • (B) tested in EN-Web
    • ... finds the note
  • (C) re-tested in EN-10 
    • ... finds the note

Regarding [A] this is a real problem to me because there is no indicator that tells me "Sync is done completely" - which is a problem not only when searching in images and PDFs...

Regarding [B] I cannot believe that EN-Web also maintains a local index - so this is a true sign that indexing is done on server side and all works well there (and in EN-Web that relies on server availability).

Regarding [C] is another true sign that syncing is a gambling in EN-10

  • Like 2
Link to comment

Yes - confirmation of syncing is an issue, but I don't think it's causing the problem I reported.

My interpretation of my problem is that something is failing (or partially failing) when moving the recoIndex attributes (see post below) from the legacy database to the V10 database. This would explain why the searching for text  partially works on old notes, and fully works on new notes.

This blog post explains the old (legacy) process, but there doesn't seem to be a V10 version with current information :

https://evernote.com/blog/how-evernotes-image-recognition-works/

I've confirmed that the legacy system enex file for the note in question contains the relevant  search phrase. The V10 note doesn't contain recoIndex entries - so I can't check them.

 

 

Link to comment

THX for the link - didn't see it so far...
... and it approves, that indexing is done on server side. Because there is no direct link between EN-10 and Legacy databases on your local system, all that is a syncing problem:

  • Every change on notes and embedded images have to be transferred to the servers
  • servers do their OCR job (at any time with regard to your account and their load)
  • after this job is done (and recoIndex is added to the note),
    • findings have to be merged to the search index (on server side)
    • and note has to be distributed to clients
  • within the clients (except Web clients)
    • a (back-)synced note has to be merged to local index tables
    • before newly found text can be found be search operations
  • within Web clients
    • the new text should be recognized immediately because all search operations are executed as server requests

If I do not oversee anything, all effects are bound to syncing lacks with the EN-10 clients.

I had some cases in which syncing between different Legacy clients was fast, complete and therefore reliable. EN-10 clients did not see changes in time even if changes in EN-10 have been sync to other clients in parallel. It seems that EN-10 use other syncing-server-farm(s) than Legacy and there is a lag in syncing the server farms... If OCR scanners run in a third class of servers, it's completely unpredictable when a note will be provided with recoIndex information.

But if all the servers are on a "normal" load, this all should be no problem. It's only of theoretical question within seconds or few minutes 😉.

Link to comment
35 minutes ago, AlbertR said:

it's completely unpredictable when a note will be provided with recoIndex information.

In the legacy version you know whether an image has been indexed and passed back to the local client because it tells you in the note information. In addition you can open the enex file in a text editor and read the search words. Neither of those options are available in V10.

I still suspect the problem is something to do with older notes having seen both systems but that is just a guess. @FrankCdoes not seem to be having any issues with new notes even if created by copying the image from an old note.

Link to comment
1 hour ago, AlbertR said:

THX for the link - didn't see it so far...
... and it approves, that indexing is done on server side. Because there is no direct link between EN-10 and Legacy databases on your local system, all that is a syncing problem:

When I said that I didn't think it was a syncing problem, what I meant was that I believed syncing had taken place, but not all of the OCR/indexing information was transferred from legacy to V10. 

Anyway, I'll see if I get anywhere with my support ticket.

Link to comment

Hmm, note representation in ENEX format seem to be different in Legacy and EN-10

  • Legacy contains found word in an recoIndex-section within the ENEX file
  • EN-10 points to a cache entry outside ENEX file (on server which in turn might be cached on client) referenced by an en-cache:-URL

By doing so, EN-10 has no need to sync the note itself after OCR scanning servers have completed their job. Background might be better syncing speed because only cached recoIndex-data take fewer bytes...
... and makes syncing effects more mystic 😉

(am Rande: Könnten wir uns hier nicht auch auf deutsch unterhalten?)

  • Like 1
Link to comment
53 minutes ago, AlbertR said:

Hmm, note representation in ENEX format seem to be different in Legacy and EN-10

Exactly,  as I pointed out at the top of this thread. Bad news for those people who are keen to do their own backup of their notes by exporting the enex files. They are not getting a complete record. If a note was ever deleted from the server and then reimported from an enex file I wonder whether it would ever be indexed. One to test perhaps.

  • Like 1
Link to comment
10 minutes ago, Mike P said:

If a note was ever deleted from the server and then reimported from an enex file I wonder whether it would ever be indexed. One to test perhaps.

OK I did the following

  • Took a screen shot using the V10 screen shot facility
  • Searched for words in the text. It took less than a minute for it to be indexed and searchable
  • Exported the note to my hard drive as an enex file
  • Opened the enex file in a text editor to confirm that there was no indexing data in the file
  • Deleted the note in EN
  • Went to the trash and deleted it permanently
  • imported the note from the enex file I'd created
  • The note was immediately searchable for the words I'd tried earlier

Actually found that quite surprising

  • Like 3
Link to comment

Fine. It means (my interpretation): After having it indexed on servers, recoIndex-information is stored on servers "behind" a GUID

<resource-attributes>...<source-url>
          en-cache://tokenKey...+https://www.evernote.com/shard/s333/res/...{GUID}
</source-url></resource-attributes>

So after deleting the note and re-importing it from ENEX, there is no need to re-index it because it points to already computed index data (which has not been deleted on server side).

This makes sense in your test case (and even real cases in which transfer/re-index times should be saved). But when and how do the servers cleanup their caches if not along with deletion of notes?

This might be the end of our game without insider knowledge 😉. Let's wait for support answer(s) Frank's ticket...

 

  • Like 4
Link to comment

Current situation is that the ticket has moved from Customer Support -> Technical Support -> Development Team (for further review).

The issue and symptoms have been confirmed and reproduced.

No date for resolution. Workaround is to use Legacy.

 

  • Like 1
  • Thanks 2
Link to comment
  • 1 month later...
  • 2 years later...

I'm experiencing search problems inside PDF's - a search on the Legacy version does find multiple words (Example "Thomas Robley") in a PDF created with Acrobat. but not in V10 or the EN web version - I sent the note with the PDF in it to the support team and they were able to recreate the same problem. They raised a support ticket back in early March 23. They say the problem have been passed to the development team but could not offer a timeline for a solution. So, I’m stuck with the Legacy version until a fix is found.

I really P’d  off that I’m having to use the Legacy version AND EN have hiked the prices for the Euro zone by 66%!

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...