Re: spellcheck: issues

Jason Rennie Wed, 08 Oct 2008 11:11:20 -0700

On Wed, Oct 8, 2008 at 1:24 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:


> Token: chane OMP: false
> Oct 8, 2008 1:19:56 PM org.apache.solr.core.SolrCore execute
> INFO: [spell] webapp=null path=/select
> params={q=description%3Achane&spellcheck=true&spellcheck.onlyMorePopular=false&spellcheck.extendedResults=true&spellcheck.count=1}
> hits=1 status=0 QTime=1
> No Suggestions


The result here seems wrong to me.  Shouldn't it suggest "chanel"?  You also
tried this same query with OMP=true and it suggested "chanel".  Maybe I'm
not understanding the purpose of OMP?  Shouldn't OMP=false return at least
as many suggestions as OMP=true?

Token: chanl OMP: false
> Oct 8, 2008 1:19:57 PM org.apache.solr.core.SolrCore execute
> INFO: [spell] webapp=null path=/select
> params={q=description%3Achanl&spellcheck=true&spellcheck.onlyMorePopular=false&spellcheck.extendedResults=true&spellcheck.count=10}
> hits=0 status=0 QTime=2
>        Sugg[0]: [chanel, chant, chang, chani, chana, chane, charl, chand,
> chan, chair]
>        Sugg[0] Freqs: [834, 10, 8, 4, 1, 1, 1, 1, 106, 1950]
>        Num Found 10
>
> ------
>
> 1)  Is this an accurate representation of what you are trying to convey?


Yes.

2)  In light of this shared code that I hope captures both the document side
> and the query side, is the issue than highlighted by the last result above,
> namely, that "chan" sorts after "chand" even though "chan" has a higher
> frequency?


I highlighted another issue above, but yes, the fact that "chan" sorts below
other single-edit terms with much lower frequencies seems like an issue to
me.  The Lucene SpellChecker page suggests a logical explanation: terms are
first sorted by the FuzzyQuery score (normalized edit distance), then by
popularity.  I'm wondering whether it would be better to sort by a single,
combined score, such as:

NewSPerScore = (edit distance) * (suggestion term length) / (original term
length) + log_1000(frequency)

Sorting according to this score would encourage longer suggestions, but not
at the expense of shorter, popular suggestion.  Might need to be tweaked
further, but I'd guess that it would do better than the two-step sort.

Cheers,

Jason

Re: spellcheck: issues

Reply via email to