Jump to content

Search Algorithm Suggestion


Leigh Riffel

Recommended Posts

I suggest that the Evernote search algorithm be tweaked in three ways.

  1. Weight notes higher if the title of the note contains all the search terms so that notes whose titles do not contain any of the search terms do not come first.
  2. Discount the weight of words based on the length of the note so that longer notes do not dominate search results.
  3. Weight a document that contains the search terms in close proximity higher so that documents using the terms but having them scattered do not superceed the more likely candidates.

These suggestions are based on my day to day use of Evernote, but here is a specific example. When I search for "Change Password" (without quotes) in Evernote, here are some of the results in order.

1. A 833 page PDF that does not contain "Change Password", but contains "Change" 338x and password 55x.

2. A 116 line, 8005 character note that does not contain "Change Password", but contains "Change" 7x and "Password" 3x.

3. A 56 line, 1919 character note that does not contain "Change Password", but contains "Change" 2x and "Password" 1x.

4. A 1334 line, 126,619 character note that does not contain "Change Password", but contains "Change" 16x and "Password" 7x.

5. A 1651 line, 100,080 character note that does not contain "Change Password", but contains "Change" 14x and "Password" 31x.

6. A 1616 line, 95,046 character note that does not contain "Change Password", but contains "Change" 18x and "Password" 13x.

7. A 9 line 445 character note titled "Change Password" and contains "Change" 5x and "Password" 4x.

...

23. A 15 line, 564 character note titled "Change Database Computer Password" and containing "Change" 7x and "Password" 6x.

In my mind #7 and #23 should both come before 2-6. They are the only ones with both search terms in the title and although they are short they mention the search terms more frequently based on the length of the documents. Their search term proximity is also very high compared to the other documents.

Link to comment
  • Level 5

If I am looking for a document that contains the phrase Change Password, I will enclose it in quotes and quickly find my answer.

I realize computers are getting faster every year, but to analyze every note for the ratio of one word to another, plus the location of the words seems to me to be putting too much of a load on the Evernote engine, especially when using Evernote on a device that is already struggling to handle the current program (like the iPod Touch).

Link to comment
  • Level 5*

I suggest that the Evernote search algorithm be tweaked in three ways.

  1. Weight notes higher if the title of the note contains all the search terms so that notes whose titles do not contain any of the search terms do not come first.
  2. Discount the weight of words based on the length of the note so that longer notes do not dominate search results.
  3. Weight a document that contains the search terms in close proximity higher so that documents using the terms but having them scattered do not superceed the more likely candidates.

This doesn't seem to have anything to do with the search algorithm (which merely whether a match succeeds or fails, and assigns notes that match to the result set), it's more of a result set ordering issue. Currently, Evernote only orders notes based on existing fields. Could they change this by adding a dynamically determined "search weight" field to those notes that match? Sure.

Link to comment
  • Level 5*

If I am looking for a document that contains the phrase Change Password, I will enclose it in quotes and quickly find my answer.

I realize computers are getting faster every year, but to analyze every note for the ratio of one word to another, plus the location of the words seems to me to be putting too much of a load on the Evernote engine, especially when using Evernote on a device that is already struggling to handle the current program (like the iPod Touch).

JB, I think the OP was just using this to provide a clear example.

What the OP is suggesting is much like how Google orders the results of a search, putting the most relavent on top. I think this could be a very useful feature. While I appreciate your concern about "putting to much of a load on the Evernote engine", I think this can be mitigate by:

  • Making the "sort by relevance" be just one more of the sort options.
    • If you don't choose it, then there is no extra load.

    [*]Computers continue to get much more powerful every year.

    • We are doing things today routinely we would not have even considered just a few years ago.

Link to comment
  • Level 5

What the OP is suggesting is much like how Google orders the results of a search, putting the most relavent on top. I think this could be a very useful feature. While I appreciate your concern about "putting to much of a load on the Evernote engine", I think this can be mitigate by:

  • Making the "sort by relevance" be just one more of the sort options.
    • If you don't choose it, then there is no extra load.

    [*]Computers continue to get much more powerful every year.

    • We are doing things today routinely we would not have even considered just a few years ago.

I remember Dave Engberg saying Evernote indexes the words in a PDF when it receives the note.

Perhaps Evernote could add a "never index multiple word" ratio analysis. But if someone turns it on, I would think the indexing processing would be slowed down and create more complaints about Evernote's search. The indexed file would become huge if the program had to analyze all the possible two-word combinations in a 833 page PDF ranging from 200,000 to 500,000 words (large font pages to academic type pages). Throw in the relative position of all the words makes it even bigger. Keep in mind, this only applies to just one note.

And eventually people would be clamoring for 3 and more word comparisons requiring increases approaching an order of magnitude.

I'm not a computer expert and could be all wet on this issue. Just my opinion.

Link to comment
  • Level 5*

There was a time in database searches when a full-text search was very expensive, meaning it could be very slow, and significantly increase indexing overhead. Today full-text search is commonplace.

I don't have a technical understanding of the Evernote Search engine, so I don't really know how hard it would be to add a sort by weighted relevance option. Obviously this is possible, since Google does it very well, and very fast. Add to this the fact that Evernote is re-executing your entire search after every character you type in the Search block (I wish it would not do this). But the fact that it does do it, and do it very fast, leads me to believe that adding an option for the Search Engine to sort by relevance could be achieved without a significant decrease in performance.

While I would like this search enhancement, there are other enhancements to the search engine that would be a higher priority for me, and be much more useful in general. For example: adding full support for Boolean search.

Link to comment
  • Level 5*

Search would need to do more in order to determine a value for relevance (by whatever criteria are used, whether they're the OP's or something else), but only for notes that were determined to have matched already (since notes that do not match are not relevant). Matched notes are (hopefully) a small percentage of the total number of notes, so the extra work might not be that much. So somewhere between match determination and result set ordering is where the relevance calculation must take place.

Extra credit for the OP (or the peanut gallery): how do other search criteria (e.g. todo:, date:, location:, etc., which have nothing or little to do with any location in the note) fit into the relevance determination?

Link to comment

Thank you jefito for bringing to light my mis-understanding of how the results are displayed. Perhaps it is the Google effect, but I was under the false impression that the results were ordered by relevance and that the sort criteria only came into play when a column was clicked on. Obviously I was mistaken.

So yes, as JMicheal said it appears that what I am asking for is that search results be ordered by relevance. Since such a thing does not exist, that easily explains why notes with the search words in the title are not at the top of the list (unless they are there because they sorted there using the existing sort criteria).

jbenson2, if the word number of each word were stored when the note was saved, ordering by the sum of the average position of each search term should be very quick. If not it could at least sum the first of each term for each note. If even that takes too long it could simply move notes with the search terms in the title to the top of the list.

In my example I didn't search for "Change Password" in quotes because I didn't know the order of the terms in the saved note.

Link to comment
  • Level 5

jbenson2, if the word number of each word were stored when the note was saved, ordering by the sum of the average position of each search term should be very quick. If not it could at least sum the first of each term for each note. If even that takes too long it could simply move notes with the search terms in the title to the top of the list.

Interesting. It is amazing what tiny little pieces of silicon can do behind the scenes. Thanks.

Link to comment
  • 1 year later...

Even a very basic "sort by relevance" would be a great help and is much needed.  When I try to retrieve a note titled something very simple (like the note I titled "mint.com" containing my info about my mint account), the note I want is burried within all the recipes I have that call for mint.  Leigh's suggestions are more sophisitcated than mine, but a sort by relevance that just puts on top the notes for which the title matches the search, then maybe notes for which other key fields have matches (tags, etc.), then maybe matches within the body of the text (order by # of appearances), and finally those with matches within the scanned word/pdf files.

 

Now that we're all scanning in vast numbers of .pdf docuemnts, simple searches are becoming useless.

 

Anticipating feedback: I understand that there are tricks I can learn like using "intitle" for seraches to achieve want I want, but these require extra work and expertise that I don't want to have to have to use your product.  I presume your goal is mass appeal, and that requires that people be able to do a simple word search and get back the note they were thinking of.  A "sort by relevance" feature would do this.  Please don't get bogged down in the details of creating the best algorithm.  Your first draft will work 90% of the time and it will always be possible to tweak the algorithm behind the scenes later.

Link to comment

Even a very basic "sort by relevance" would be a great help and is much needed.  When I try to retrieve a note titled something very simple (like the note I titled "mint.com" containing my info about my mint account), the note I want is burried within all the recipes I have that call for mint.  Leigh's suggestions are more sophisitcated than mine, but a sort by relevance that just puts on top the notes for which the title matches the search, then maybe notes for which other key fields have matches (tags, etc.), then maybe matches within the body of the text (order by # of appearances), and finally those with matches within the scanned word/pdf files.

 

Now that we're all scanning in vast numbers of .pdf docuemnts, simple searches are becoming useless.

 

Anticipating feedback: I understand that there are tricks I can learn like using "intitle" for seraches to achieve want I want, but these require extra work and expertise that I don't want to have to have to use your product.  I presume your goal is mass appeal, and that requires that people be able to do a simple word search and get back the note they were thinking of.  A "sort by relevance" feature would do this.  Please don't get bogged down in the details of creating the best algorithm.  Your first draft will work 90% of the time and it will always be possible to tweak the algorithm behind the scenes later.

 

Searching for a particular note using only 'mint.com' is pretty vague (just as it would be on Google) & I suggest you use a more refined search along with keywords & descriptive titles & tags.  In this case, the word mint is pretty generic, so I'd suggest you use a tag along with keywords & a descriptive title on your mint.com notes.  IE, I have used a product that was originally called Neat Receipts/NeatReceipts.  Then it was Neat Works/Neatworks, then Neat Desk/Neatdesk as well as just plain ol' Neat.  For those notes, I apply a tag of "NeatReceipts/works/whatever'.  I never remembered if I put a space between the two words & then with the name changes, it was just too problematic.  The tag solved this problem.

 

You can then refine your search by using a date range or adding a keyword.  IE, I buy a lot of things from Amazon.  But if I add a description of the item(s), it's easy to find the one or two notes pertaining to the note I'm looking for.  Evernote provides the tools for a powerful search in order to find one or a few notes out of tens of thousands.  But you have to use them.  If you say you don't want to learn or use them, then that's unfortunate.  Every organizing tool requires a certain amount of discipline by the user in order to function properly.  It literally takes seconds to do these things.

 

Good luck with finding something that better suits your needs.

Link to comment
  • Level 5*

Even a very basic "sort by relevance" would be a great help and is much needed.  When I try to retrieve a note titled something very simple (like the note I titled "mint.com" containing my info about my mint account), the note I want is burried within all the recipes I have that call for mint.

Just as a point of usage, when searching on a text string that contains embedded punctuation, you should be able to enclose the string in double quotes to get a match. At least that's what my test in the Windows and the web clients showed.

In reality, technically, it's a little trickier than that, as that particular search will match on "lint" followed by any number of spaces or carriage returns or punctuation, followed by "com". But using quoted "lint.com" is going to narrow the search result space considerably.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...