RE: Spelling Corrector Algorithm

Husain, Yavar Thu, 01 Mar 2012 09:42:42 -0800

Thanks James. I loved the last line in your mail  "But in the end, especially 
with 1-word queries, I doubt even the best algorithms are going to always 
accurately guess what the user wanted." Absolutely I agree to this; if it is a 
phrase (instead of single word) then probably we can apply some NLP stuff.
________________________________________
From: Dyer, James [james.d...@ingrambook.com]
Sent: Thursday, March 01, 2012 9:29 PM
To: solr-user@lucene.apache.org
Subject: RE: Spelling Corrector Algorithm


Yavar,

When you listed what the spell checker returns you put them in this order:

> Marine (Freq: 120), Market (Freq: 900) and others

Was "Marine" listed first, and then did you pick "Market" because you thought 
higher frequency is better?  If so, you probably have the right settings 
already but need to trust it and go with the first result.

If, on the other hand, the wrong suggestions truly are coming up first, you 
have 2 extension points:

1. You can change the comparator class.  The default one sorts by "score" 
(distance) first and then "frequency" to break ties.  There is also a 
pre-packed comparator that sorts just on frequency, or you can write your own 
(implementing Comparator<org.apache.lucene.search.spell.SuggestWord>) . But I 
doubt you'd want to change this one.  (see 
http://wiki.apache.org/solr/SpellCheckComponent#Custom_Comparators_and_the_Lucene_Spell_Checkers_.28IndexBasedSpellChecker.2C_FileBasedSpellChecker.2C_DirectSolrSpellChecker.29
 for more info)

2. You can change the distance metric.  The default uses Levenshtein distance, 
but there is also an implementation for Jaro-Winkler distance.  (see the 
wikipedia articles for these 2 if you want to know the subtle differences).  It 
almost seems to me that Jaro-Winkler might give you better results but you'd 
have to test.  See the example under 
http://wiki.apache.org/solr/SpellCheckComponent?highlight=%28distanceMeasure%29#Configuration
 for more information on how to configure this.

If neither distance measure works for you, you could try implementing your own 
by creating a class implementing 
"org.apache.lucene.search.spell.StringDistance", then specify your class for 
the "distanceMeasure" parameter.

Finally, there are some other possibly easy solutions to your problem you 
should test before going through the trouble of writing custom code:

1. Try a higher "spellcheck.count".  Even if you only want a couple of results 
the algorithm works better with this set >5 (10-20 might be optimial in some 
cases).
2. Use DirectSolrSpellChecker, if on 4.x.  This one is not influenced by 
"spellcheck.count", so you can truly set it to 1 if all you want is 1 result.
3. Use "spellcheck.collate=true" and set "spellcheck.maxCollationTries" to 
maybe 5 or 10.  This will try the various suggestions by querying the index 
along with any other query parameters (other keywords, filters, etc), letting 
you know which suggestions are going to truly return hits in context (and how 
many).
4. Try Jaro-Winkler (as mentioned above).

Hope this helps.  But in the end, especially with 1-word queries, I doubt even 
the best algorithms are going to always accurately guess what the user wanted.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Husain, Yavar [mailto:yhus...@firstam.com]
Sent: Thursday, March 01, 2012 5:43 AM
To: solr-user@lucene.apache.org
Subject: Spelling Corrector Algorithm

Hi

For spell checking component I set extendedResults to get the frequencies and 
then select the word with the best frequency. I understand the spell check 
algorithm based on Edit Distance. For an example:

Query to Solr: Marien

Spell Check Text Returned: Marine (Freq: 120), Market (Freq: 900) and others. 
My dictionary here is based on indexed words.

So I chose Market (more frequency) however which is wrong as my intent was 
marine. Both have Edit Distance of 2.

Now how can I improve this Algorithm to select marine instead of market (based 
on something more than edit distance and frequency stuff)?

Do I have to incorporate some "soundex" algorithms too?

I am looking for simple stuff which I can quickly implement.

I even tried using Peter Norvig's spell corrector Algorithm (which is great) 
but again I ran in same problems.
</PRE>
<BR>
******************************************************************************************<BR>This
 message may contain confidential or proprietary information intended only for 
the use of the<BR>addressee(s) named above or may contain information that is 
legally privileged. If you are<BR>not the intended addressee, or the person 
responsible for delivering it to the intended addressee,<BR>you are hereby 
notified that reading, disseminating, distributing or copying this message is 
strictly<BR>prohibited. If you have received this message by mistake, please 
immediately notify us by<BR>replying to the message and delete the original 
message and any copies immediately thereafter.<BR>
<BR>
Thank you.~<BR>
******************************************************************************************<BR>
FAFLD<BR>
<PRE>

RE: Spelling Corrector Algorithm

Reply via email to