On Oct 8, 2008, at 2:03 PM, Jason Rennie wrote:

On Wed, Oct 8, 2008 at 1:24 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

Token: chane OMP: false
Oct 8, 2008 1:19:56 PM org.apache.solr.core.SolrCore execute
INFO: [spell] webapp=null path=/select
params={q=description %3Achane &spellcheck = true &spellcheck .onlyMorePopular =false&spellcheck.extendedResults=true&spellcheck.count=1}
hits=1 status=0 QTime=1
No Suggestions


The result here seems wrong to me. Shouldn't it suggest "chanel"? You also tried this same query with OMP=true and it suggested "chanel". Maybe I'm not understanding the purpose of OMP? Shouldn't OMP=false return at least
as many suggestions as OMP=true?

chane is in the dictionary. For better or worse, Lucene skips words that are in the dictionary when OMP is false.




Token: chanl OMP: false
Oct 8, 2008 1:19:57 PM org.apache.solr.core.SolrCore execute
INFO: [spell] webapp=null path=/select
params={q=description %3Achanl &spellcheck = true &spellcheck .onlyMorePopular =false&spellcheck.extendedResults=true&spellcheck.count=10}
hits=0 status=0 QTime=2
Sugg[0]: [chanel, chant, chang, chani, chana, chane, charl, chand,
chan, chair]
      Sugg[0] Freqs: [834, 10, 8, 4, 1, 1, 1, 1, 106, 1950]
      Num Found 10

------

1) Is this an accurate representation of what you are trying to convey?


Yes.

2) In light of this shared code that I hope captures both the document side
and the query side, is the issue than highlighted by the last result above, namely, that "chan" sorts after "chand" even though "chan" has a higher
frequency?


I highlighted another issue above, but yes, the fact that "chan" sorts below other single-edit terms with much lower frequencies seems like an issue to me. The Lucene SpellChecker page suggests a logical explanation: terms are first sorted by the FuzzyQuery score (normalized edit distance), then by popularity. I'm wondering whether it would be better to sort by a single,
combined score, such as:

NewSPerScore = (edit distance) * (suggestion term length) / (original term
length) + log_1000(frequency)

Sorting according to this score would encourage longer suggestions, but not at the expense of shorter, popular suggestion. Might need to be tweaked
further, but I'd guess that it would do better than the two-step sort.


Makes sense to me. I could see the Spellchecker being modified (in Lucene) to provide alternate scoring/sorting. Right now, you can use other distance measures, as well, so you could codify your idea and try it out to see if it is better (and then donate it!) You might try the Jaro-Winkler measure, too, as it is a bit more sophisticated than Levenstein when it comes to scoring.

Reply via email to