I've been using spellcheck.count=10 since that seems to yield a much better top result than using the default count of 1. However, I'm still seeing weird cases. Here are a few queries with returned suggestions. Frequency counts are in parenthesis.
- query is "candyz". Suggestions are: 1. "candyâ" (1), 2. "candy" (965), ... #2 is vastly more popular than #1 and involves the same # of edits. Why would it order suggestions this way? - query is "yellw". Suggestions are: 1. "yellow" (2880), 2. "yello" (2), 3. "yelow" (1), 4. "yell" (74), ... Shouldn't "yell" come before "yello" and "yelow" due to the higher frequency? - query is "yello". 53 document hits. No suggestions. "yellow" yields 36560 document. Does the spellchecker only run when there are no document hits? Btw, is there a better place to be posting comments/questions like this? Jason On Mon, Oct 6, 2008 at 4:08 PM, Jason Rennie <[EMAIL PROTECTED]> wrote: > I've noticed a few issues with spellcheck as I've been testing it out for > use on our site... > > 1. Rebuild breaks requests - I'm using rebuildOnCommit ATM. If a > commit is going on and files are being rebuilt in the spellcheck data dir, > spellcheck requests yield bogus answers. I.e. I can issue identical > requests and get drastically different answers. The first time, I get > suggestions and "correctlySpelled" is false. The second time (during the > commit), I get no suggestions and "correctlySpelled" is true. Shouldn't > spellcheck use the old index until the new one is ready for use, like solr > does with optimizes? > 2. Inconsistent ordering - The first suggestion changes depending on > the spellcheck.count that I specify. If my query is "chanl" and I ask for > one result, the suggestion is "chant" (freq. 16). If I ask for 5 results, > the first suggestion is also "chant"; the other 4 suggestions are less > frequent (e.g. "chang", freq. 11). However, if I ask for 10 results, the > first suggestion is "chanel" (freq. 1296); #2 and #3 are "chant" and > "chang"; #9 is "chan" (freq. 174). Shouldn't spellcheck return the best > suggestion first? In my case, shouldn't "chanel" always top "chant" and > "chang" since they all have the same edit distance yet "chanel" is two > orders of mangnitude more popular? > > Is there anything I could be doing wrong to create these problems? If not, > are these known issues? If not, should I create jira's for them? > > Thanks, > > Jason > > -- Jason Rennie Head of Machine Learning Technologies, StyleFeeder http://www.stylefeeder.com/ Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/