Re: spellcheck: issues

Grant Ingersoll Wed, 08 Oct 2008 12:06:10 -0700


On Oct 8, 2008, at 2:03 PM, Jason Rennie wrote:

On Wed, Oct 8, 2008 at 1:24 PM, Grant Ingersoll<[EMAIL PROTECTED]> wrote:
Token: chane OMP: false
Oct 8, 2008 1:19:56 PM org.apache.solr.core.SolrCore execute
INFO: [spell] webapp=null path=/select
params={q=description%3Achane&spellcheck=true&spellcheck.onlyMorePopular=false&spellcheck.extendedResults=true&spellcheck.count=1}
hits=1 status=0 QTime=1
No Suggestions
The result here seems wrong to me. Shouldn't it suggest "chanel"?You alsotried this same query with OMP=true and it suggested "chanel".Maybe I'mnot understanding the purpose of OMP? Shouldn't OMP=false return atleast
as many suggestions as OMP=true?

chane is in the dictionary. For better or worse, Lucene skips wordsthat are in the dictionary when OMP is false.

Token: chanl OMP: false
Oct 8, 2008 1:19:57 PM org.apache.solr.core.SolrCore execute
INFO: [spell] webapp=null path=/select
params={q=description%3Achanl&spellcheck=true&spellcheck.onlyMorePopular=false&spellcheck.extendedResults=true&spellcheck.count=10}
hits=0 status=0 QTime=2
Sugg[0]: [chanel, chant, chang, chani, chana, chane, charl,chand,
chan, chair]
      Sugg[0] Freqs: [834, 10, 8, 4, 1, 1, 1, 1, 106, 1950]
      Num Found 10

------
1) Is this an accurate representation of what you are trying toconvey?
Yes.
2) In light of this shared code that I hope captures both thedocument side
and the query side, is the issue than highlighted by the lastresult above,namely, that "chan" sorts after "chand" even though "chan" has ahigher
frequency?
I highlighted another issue above, but yes, the fact that "chan"sorts belowother single-edit terms with much lower frequencies seems like anissue tome. The Lucene SpellChecker page suggests a logical explanation:terms arefirst sorted by the FuzzyQuery score (normalized edit distance),then bypopularity. I'm wondering whether it would be better to sort by asingle,
combined score, such as:
NewSPerScore = (edit distance) * (suggestion term length) /(original term
length) + log_1000(frequency)
Sorting according to this score would encourage longer suggestions,but notat the expense of shorter, popular suggestion. Might need to betweaked
further, but I'd guess that it would do better than the two-step sort.

Makes sense to me. I could see the Spellchecker being modified (inLucene) to provide alternate scoring/sorting. Right now, you can useother distance measures, as well, so you could codify your idea andtry it out to see if it is better (and then donate it!)You might try the Jaro-Winkler measure, too, as it is a bit moresophisticated than Levenstein when it comes to scoring.

Re: spellcheck: issues

Reply via email to