On Wed, Oct 8, 2008 at 1:24 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> Token: chane OMP: false > Oct 8, 2008 1:19:56 PM org.apache.solr.core.SolrCore execute > INFO: [spell] webapp=null path=/select > params={q=description%3Achane&spellcheck=true&spellcheck.onlyMorePopular=false&spellcheck.extendedResults=true&spellcheck.count=1} > hits=1 status=0 QTime=1 > No Suggestions The result here seems wrong to me. Shouldn't it suggest "chanel"? You also tried this same query with OMP=true and it suggested "chanel". Maybe I'm not understanding the purpose of OMP? Shouldn't OMP=false return at least as many suggestions as OMP=true? Token: chanl OMP: false > Oct 8, 2008 1:19:57 PM org.apache.solr.core.SolrCore execute > INFO: [spell] webapp=null path=/select > params={q=description%3Achanl&spellcheck=true&spellcheck.onlyMorePopular=false&spellcheck.extendedResults=true&spellcheck.count=10} > hits=0 status=0 QTime=2 > Sugg[0]: [chanel, chant, chang, chani, chana, chane, charl, chand, > chan, chair] > Sugg[0] Freqs: [834, 10, 8, 4, 1, 1, 1, 1, 106, 1950] > Num Found 10 > > ------ > > 1) Is this an accurate representation of what you are trying to convey? Yes. 2) In light of this shared code that I hope captures both the document side > and the query side, is the issue than highlighted by the last result above, > namely, that "chan" sorts after "chand" even though "chan" has a higher > frequency? I highlighted another issue above, but yes, the fact that "chan" sorts below other single-edit terms with much lower frequencies seems like an issue to me. The Lucene SpellChecker page suggests a logical explanation: terms are first sorted by the FuzzyQuery score (normalized edit distance), then by popularity. I'm wondering whether it would be better to sort by a single, combined score, such as: NewSPerScore = (edit distance) * (suggestion term length) / (original term length) + log_1000(frequency) Sorting according to this score would encourage longer suggestions, but not at the expense of shorter, popular suggestion. Might need to be tweaked further, but I'd guess that it would do better than the two-step sort. Cheers, Jason