On Oct 6, 2008, at 6:10 PM, Jason Rennie wrote:

I've been using spellcheck.count=10 since that seems to yield a much better top result than using the default count of 1. However, I'm still seeing weird cases. Here are a few queries with returned suggestions. Frequency
counts are in parenthesis.

- query is "candyz". Suggestions are: 1. "candyâ" (1), 2. "candy" (965), ... #2 is vastly more popular than #1 and involves the same # of edits.
  Why would it order suggestions this way?

I'm guessing the edit distance is less????



- query is "yellw". Suggestions are: 1. "yellow" (2880), 2. "yello" (2), 3. "yelow" (1), 4. "yell" (74), ... Shouldn't "yell" come before "yello"
  and "yelow" due to the higher frequency?

Again, probably b/c of the distance. What distance measure are you using?


- query is "yello". 53 document hits. No suggestions. "yellow" yields 36560 document. Does the spellchecker only run when there are no document
  hits?

No, it should run in both cases. Can you reproduce in a small test case?



Btw, is there a better place to be posting comments/questions like this?

Possibly, but here's the place to start. They may be Lucene SC issues, but let's diagnose here, first, and then move to there if needed.




Jason

On Mon, Oct 6, 2008 at 4:08 PM, Jason Rennie <[EMAIL PROTECTED]> wrote:

I've noticed a few issues with spellcheck as I've been testing it out for
use on our site...

  1. Rebuild breaks requests - I'm using rebuildOnCommit ATM.  If a
commit is going on and files are being rebuilt in the spellcheck data dir, spellcheck requests yield bogus answers. I.e. I can issue identical requests and get drastically different answers. The first time, I get suggestions and "correctlySpelled" is false. The second time (during the commit), I get no suggestions and "correctlySpelled" is true. Shouldn't spellcheck use the old index until the new one is ready for use, like solr
  does with optimizes?
2. Inconsistent ordering - The first suggestion changes depending on the spellcheck.count that I specify. If my query is "chanl" and I ask for one result, the suggestion is "chant" (freq. 16). If I ask for 5 results, the first suggestion is also "chant"; the other 4 suggestions are less frequent (e.g. "chang", freq. 11). However, if I ask for 10 results, the first suggestion is "chanel" (freq. 1296); #2 and #3 are "chant" and "chang"; #9 is "chan" (freq. 174). Shouldn't spellcheck return the best suggestion first? In my case, shouldn't "chanel" always top "chant" and "chang" since they all have the same edit distance yet "chanel" is two
  orders of mangnitude more popular?

Is there anything I could be doing wrong to create these problems? If not,
are these known issues?  If not, should I create jira's for them?

Thanks,

Jason




--
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








Reply via email to