Re: spellcheck: issues

Grant Ingersoll Tue, 07 Oct 2008 08:59:43 -0700


On Oct 6, 2008, at 6:10 PM, Jason Rennie wrote:

I've been using spellcheck.count=10 since that seems to yield a muchbettertop result than using the default count of 1. However, I'm stillseeingweird cases. Here are a few queries with returned suggestions.Frequency
counts are in parenthesis.
- query is "candyz". Suggestions are: 1. "candyâ" (1), 2."candy" (965),... #2 is vastly more popular than #1 and involves the same # ofedits.
  Why would it order suggestions this way?


I'm guessing the edit distance is less????

- query is "yellw". Suggestions are: 1. "yellow" (2880), 2."yello" (2),3. "yelow" (1), 4. "yell" (74), ... Shouldn't "yell" come before"yello"
  and "yelow" due to the higher frequency?

Again, probably b/c of the distance. What distance measure are youusing?

- query is "yello". 53 document hits. No suggestions. "yellow"yields36560 document. Does the spellchecker only run when there are nodocument
  hits?

No, it should run in both cases. Can you reproduce in a small testcase?

Btw, is there a better place to be posting comments/questions likethis?

Possibly, but here's the place to start. They may be Lucene SCissues, but let's diagnose here, first, and then move to there ifneeded.

Jason
On Mon, Oct 6, 2008 at 4:08 PM, Jason Rennie <[EMAIL PROTECTED]>wrote:
I've noticed a few issues with spellcheck as I've been testing itout for
use on our site...

  1. Rebuild breaks requests - I'm using rebuildOnCommit ATM.  If a
commit is going on and files are being rebuilt in the spellcheckdata dir,spellcheck requests yield bogus answers. I.e. I can issueidenticalrequests and get drastically different answers. The first time,I getsuggestions and "correctlySpelled" is false. The second time(during thecommit), I get no suggestions and "correctlySpelled" is true.Shouldn'tspellcheck use the old index until the new one is ready for use,like solr
  does with optimizes?
2. Inconsistent ordering - The first suggestion changes dependingonthe spellcheck.count that I specify. If my query is "chanl" andI ask forone result, the suggestion is "chant" (freq. 16). If I ask for 5results,the first suggestion is also "chant"; the other 4 suggestions arelessfrequent (e.g. "chang", freq. 11). However, if I ask for 10results, thefirst suggestion is "chanel" (freq. 1296); #2 and #3 are "chant"and"chang"; #9 is "chan" (freq. 174). Shouldn't spellcheck returnthe bestsuggestion first? In my case, shouldn't "chanel" always top"chant" and"chang" since they all have the same edit distance yet "chanel"is two
  orders of mangnitude more popular?
Is there anything I could be doing wrong to create these problems?If not,
are these known issues?  If not, should I create jira's for them?

Thanks,

Jason
--
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/


--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: spellcheck: issues

Reply via email to