How do you implement a "Did you mean"?

Nlp

Nlp Problem Overview


> Possible Duplicate:
> How does the Google “Did you mean?” Algorithm work?

Suppose you have a search system already in your website. How can you implement the "Did you mean:<spell_checked_word>" like Google does in some search queries?

Nlp Solutions


Solution 1 - Nlp

Actually what Google does is very much non-trivial and also at first counter-intuitive. They don't do anything like check against a dictionary, but rather they make use of statistics to identify "similar" queries that returned more results than your query, the exact algorithm is of course not known.

There are different sub-problems to solve here, as a fundamental basis for all Natural Language Processing statistics related there is one must have book: Foundation of Statistical Natural Language Processing.

Concretely to solve the problem of word/query similarity I have had good results with using Edit Distance, a mathematical measure of string similarity that works surprisingly well. I used to use Levenshtein but the others may be worth looking into.

Soundex - in my experience - is crap.

Actually efficiently storing and searching a large dictionary of misspelled words and having sub second retrieval is again non-trivial, your best bet is to make use of existing full text indexing and retrieval engines (i.e. not your database's one), of which Lucene is currently one of the best and coincidentally ported to many many platforms.

Solution 2 - Nlp

Google's Dr Norvig has outlined how it works; he even gives a 20ish line Python implementation:

http://googlesystem.blogspot.com/2007/04/simplified-version-of-googles-spell.html

http://www.norvig.com/spell-correct.html

Dr Norvig also discusses the "did you mean" in this excellent talk. Dr Norvig is head of research at Google - when asked how "did you mean" is implemented, his answer is authoritive.

So its spell-checking, presumably with a dynamic dictionary build from other searches or even actual internet phrases and such. But that's still spell checking.

SOUNDEX and other guesses don't get a look in, people!

Solution 3 - Nlp

Check this article on wikipedia about the Levenshtein distance. Make sure you take a good look at Possible improvements.

Solution 4 - Nlp

I was pleasantly surprised that someone has asked how to create a state-of-the-art spelling suggestion system for search engines. I have been working on this subject for more than a year for a search engine company and I can point to information on the public domain on the subject.

As was mentioned in a previous post, Google (and Microsoft and Yahoo!) do not use any predefined dictionary nor do they employ hordes of linguists that ponder over the possible misspellings of queries. That would be impossible due to the scale of the problem but also because it is not clear that people could actually correctly identify when and if a query is misspelled.

Instead there is a simple and rather effective principle that is also valid for all European languages. Get all the unique queries on your search logs, calculate the edit distance between all pairs of queries, assuming that the reference query is the one that has the highest count.

This simple algorithm will work great for many types of queries. If you want to take it to the next level then I suggest you read the paper by Microsoft Research on that subject. You can find it here

The paper has a great introduction but after that you will need to be knowledgeable with concepts such as the Hidden Markov Model.

Solution 5 - Nlp

I would suggest looking at SOUNDEX to find similar words in your database.

You can also access google own dictionary by using the Google API spelling suggestion request.

Solution 6 - Nlp

You may want to look at Peter Norvig's "How to Write a Spelling Corrector" article.

Solution 7 - Nlp

I believe Google logs all queries and identifies when someone makes a spelling correction. This correction may then be suggested when others supply the same first query. This will work for any language, in fact any string of any characters.

Solution 8 - Nlp

Solution 9 - Nlp

I think this depends on how big your website it. On our local Intranet which is used by about 500 member of staff, I simply look at the search phrases that returned zero results and enter that search phrase with the new suggested search phrase into a SQL table.

I them call on that table if no search results has been returned, however, this only works if the site is relatively small and I only do it for search phrases which are the most common.

You might also want to look at my answer to a similar question:

Solution 10 - Nlp

If you have industry specific translations, you will likely need a thesaurus. For example, I worked in the jewelry industry and there were abbreviate in our descriptions such as kt - karat, rd - round, cwt - carat weight... Endeca (the search engine at that job) has a thesaurus that will translate from common misspellings, but it does require manual intervention.

Solution 11 - Nlp

Solution 12 - Nlp

Soundex is good for phonetic matches, but works best with peoples' names (it was originally developed for census data)

Also check out Full-Text-Indexing, the syntax is different from Google logic, but it's very quick and can deal with similar language elements.

Solution 13 - Nlp

Soundex and "Porter stemming" (soundex is trivial, not sure about porter stemming).

Solution 14 - Nlp

There's something called aspell that might help: http://blog.evanweaver.com/files/doc/fauna/raspell/classes/Aspell.html

There's a ruby gem for it, but I don't know how to talk to it from python http://blog.evanweaver.com/files/doc/fauna/raspell/files/README.html

Here's a quote from the ruby implementation

> Usage > > Aspell lets you check words and suggest corrections. For example: > > string = "my haert wil go on" >
> string.gsub(/[\w']+/) do |word| > if !speller.check(word) > # word is wrong > puts "Possible correction for #{word}:" > puts speller.suggest(word).first > end > end

This outputs:

Possible correction for haert: heart Possible correction for wil: Will

Solution 15 - Nlp

Implementing spelling correction for search engines in an effective way is not trivial (you can't just compute the edit/levenshtein distance to every possible word). A solution based on k-gram indexes is described in Introduction to Information Retrieval (full text available online).

Solution 16 - Nlp

U could use ngram for the comparisment: http://en.wikipedia.org/wiki/N-gram

Using python ngram module: http://packages.python.org/ngram/index.html

import ngram

G2 = ngram.NGram([  "iis7 configure ftp 7.5",
                    "ubunto configre 8.5",
                    "mac configure ftp"])

print "String", "\t", "Similarity"
for i in G2.search("iis7 configurftp 7.5", threshold=0.1):
    print i[1], "\t", i[0]

U get:

>>> 
String 	Similarity
0.76	"iis7 configure ftp 7.5" 	
0.24	"mac configure ftp"
0.19	"ubunto configre 8.5" 	

Solution 17 - Nlp

Why not use google's did you mean in your code.For how see here http://narenonit.blogspot.com/2012/08/trick-for-using-googles-did-you-mean.html

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionpekView Question on Stackoverflow
Solution 1 - NlpBoris TerzicView Answer on Stackoverflow
Solution 2 - NlpWillView Answer on Stackoverflow
Solution 3 - NlpIonut AnghelcoviciView Answer on Stackoverflow
Solution 4 - NlpCostas BoulisView Answer on Stackoverflow
Solution 5 - NlpEspoView Answer on Stackoverflow
Solution 6 - NlpFA.View Answer on Stackoverflow
Solution 7 - NlpLiamView Answer on Stackoverflow
Solution 8 - NlprobakerView Answer on Stackoverflow
Solution 9 - NlpGateKillerView Answer on Stackoverflow
Solution 10 - NlpoglesterView Answer on Stackoverflow
Solution 11 - NlpcherouvimView Answer on Stackoverflow
Solution 12 - NlpKeithView Answer on Stackoverflow
Solution 13 - NlpMichael NealeView Answer on Stackoverflow
Solution 14 - NlpVishuView Answer on Stackoverflow
Solution 15 - NlpFabian SteegView Answer on Stackoverflow
Solution 16 - Nlphugo24View Answer on Stackoverflow
Solution 17 - NlpNarendra RajputView Answer on Stackoverflow