On Oct 8, 2008, at 2:03 PM, Jason Rennie wrote:
On Wed, Oct 8, 2008 at 1:24 PM, Grant Ingersoll
<[EMAIL PROTECTED]> wrote:
Token: chane OMP: false
Oct 8, 2008 1:19:56 PM org.apache.solr.core.SolrCore execute
INFO: [spell] webapp=null path=/select
params={q=description
%3Achane
&spellcheck
=
true
&spellcheck
.onlyMorePopular
=false&spellcheck.extendedResults=true&spellcheck.count=1}
hits=1 status=0 QTime=1
No Suggestions
The result here seems wrong to me. Shouldn't it suggest "chanel"?
You also
tried this same query with OMP=true and it suggested "chanel".
Maybe I'm
not understanding the purpose of OMP? Shouldn't OMP=false return at
least
as many suggestions as OMP=true?
chane is in the dictionary. For better or worse, Lucene skips words
that are in the dictionary when OMP is false.
Token: chanl OMP: false
Oct 8, 2008 1:19:57 PM org.apache.solr.core.SolrCore execute
INFO: [spell] webapp=null path=/select
params={q=description
%3Achanl
&spellcheck
=
true
&spellcheck
.onlyMorePopular
=false&spellcheck.extendedResults=true&spellcheck.count=10}
hits=0 status=0 QTime=2
Sugg[0]: [chanel, chant, chang, chani, chana, chane, charl,
chand,
chan, chair]
Sugg[0] Freqs: [834, 10, 8, 4, 1, 1, 1, 1, 106, 1950]
Num Found 10
------
1) Is this an accurate representation of what you are trying to
convey?
Yes.
2) In light of this shared code that I hope captures both the
document side
and the query side, is the issue than highlighted by the last
result above,
namely, that "chan" sorts after "chand" even though "chan" has a
higher
frequency?
I highlighted another issue above, but yes, the fact that "chan"
sorts below
other single-edit terms with much lower frequencies seems like an
issue to
me. The Lucene SpellChecker page suggests a logical explanation:
terms are
first sorted by the FuzzyQuery score (normalized edit distance),
then by
popularity. I'm wondering whether it would be better to sort by a
single,
combined score, such as:
NewSPerScore = (edit distance) * (suggestion term length) /
(original term
length) + log_1000(frequency)
Sorting according to this score would encourage longer suggestions,
but not
at the expense of shorter, popular suggestion. Might need to be
tweaked
further, but I'd guess that it would do better than the two-step sort.
Makes sense to me. I could see the Spellchecker being modified (in
Lucene) to provide alternate scoring/sorting. Right now, you can use
other distance measures, as well, so you could codify your idea and
try it out to see if it is better (and then donate it!)
You might try the Jaro-Winkler measure, too, as it is a bit more
sophisticated than Levenstein when it comes to scoring.