I've been using spellcheck.count=10 since that seems to yield a much better
top result than using the default count of 1.  However, I'm still seeing
weird cases.  Here are a few queries with returned suggestions.  Frequency
counts are in parenthesis.

   - query is "candyz".  Suggestions are: 1. "candyâ" (1), 2. "candy" (965),
   ...  #2 is vastly more popular than #1 and involves the same # of edits.
   Why would it order suggestions this way?
   - query is "yellw".  Suggestions are: 1. "yellow" (2880), 2. "yello" (2),
   3. "yelow" (1), 4. "yell" (74), ...  Shouldn't "yell" come before "yello"
   and "yelow" due to the higher frequency?
   - query is "yello".  53 document hits.  No suggestions.  "yellow" yields
   36560 document.  Does the spellchecker only run when there are no document
   hits?

Btw, is there a better place to be posting comments/questions like this?

Jason

On Mon, Oct 6, 2008 at 4:08 PM, Jason Rennie <[EMAIL PROTECTED]> wrote:

> I've noticed a few issues with spellcheck as I've been testing it out for
> use on our site...
>
>    1. Rebuild breaks requests - I'm using rebuildOnCommit ATM.  If a
>    commit is going on and files are being rebuilt in the spellcheck data dir,
>    spellcheck requests yield bogus answers.  I.e. I can issue identical
>    requests and get drastically different answers.  The first time, I get
>    suggestions and "correctlySpelled" is false.  The second time (during the
>    commit), I get no suggestions and "correctlySpelled" is true.  Shouldn't
>    spellcheck use the old index until the new one is ready for use, like solr
>    does with optimizes?
>    2. Inconsistent ordering - The first suggestion changes depending on
>    the spellcheck.count that I specify.  If my query is "chanl" and I ask for
>    one result, the suggestion is "chant" (freq. 16).  If I ask for 5 results,
>    the first suggestion is also "chant"; the other 4 suggestions are less
>    frequent (e.g. "chang", freq. 11).  However, if I ask for 10 results, the
>    first suggestion is "chanel" (freq. 1296); #2 and #3 are "chant" and
>    "chang"; #9 is "chan" (freq. 174).  Shouldn't spellcheck return the best
>    suggestion first?  In my case, shouldn't "chanel" always top "chant" and
>    "chang" since they all have the same edit distance yet "chanel" is two
>    orders of mangnitude more popular?
>
> Is there anything I could be doing wrong to create these problems?  If not,
> are these known issues?  If not, should I create jira's for them?
>
> Thanks,
>
> Jason
>
>


-- 
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/

Reply via email to