On Oct 6, 2008, at 6:10 PM, Jason Rennie wrote:
I've been using spellcheck.count=10 since that seems to yield a much
better
top result than using the default count of 1. However, I'm still
seeing
weird cases. Here are a few queries with returned suggestions.
Frequency
counts are in parenthesis.
- query is "candyz". Suggestions are: 1. "candyâ" (1), 2.
"candy" (965),
... #2 is vastly more popular than #1 and involves the same # of
edits.
Why would it order suggestions this way?
I'm guessing the edit distance is less????
- query is "yellw". Suggestions are: 1. "yellow" (2880), 2.
"yello" (2),
3. "yelow" (1), 4. "yell" (74), ... Shouldn't "yell" come before
"yello"
and "yelow" due to the higher frequency?
Again, probably b/c of the distance. What distance measure are you
using?
- query is "yello". 53 document hits. No suggestions. "yellow"
yields
36560 document. Does the spellchecker only run when there are no
document
hits?
No, it should run in both cases. Can you reproduce in a small test
case?
Btw, is there a better place to be posting comments/questions like
this?
Possibly, but here's the place to start. They may be Lucene SC
issues, but let's diagnose here, first, and then move to there if
needed.
Jason
On Mon, Oct 6, 2008 at 4:08 PM, Jason Rennie <[EMAIL PROTECTED]>
wrote:
I've noticed a few issues with spellcheck as I've been testing it
out for
use on our site...
1. Rebuild breaks requests - I'm using rebuildOnCommit ATM. If a
commit is going on and files are being rebuilt in the spellcheck
data dir,
spellcheck requests yield bogus answers. I.e. I can issue
identical
requests and get drastically different answers. The first time,
I get
suggestions and "correctlySpelled" is false. The second time
(during the
commit), I get no suggestions and "correctlySpelled" is true.
Shouldn't
spellcheck use the old index until the new one is ready for use,
like solr
does with optimizes?
2. Inconsistent ordering - The first suggestion changes depending
on
the spellcheck.count that I specify. If my query is "chanl" and
I ask for
one result, the suggestion is "chant" (freq. 16). If I ask for 5
results,
the first suggestion is also "chant"; the other 4 suggestions are
less
frequent (e.g. "chang", freq. 11). However, if I ask for 10
results, the
first suggestion is "chanel" (freq. 1296); #2 and #3 are "chant"
and
"chang"; #9 is "chan" (freq. 174). Shouldn't spellcheck return
the best
suggestion first? In my case, shouldn't "chanel" always top
"chant" and
"chang" since they all have the same edit distance yet "chanel"
is two
orders of mangnitude more popular?
Is there anything I could be doing wrong to create these problems?
If not,
are these known issues? If not, should I create jira's for them?
Thanks,
Jason
--
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/
--------------------------
Grant Ingersoll
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ