I am not super familiar with the lucene/solr spell checking implementations, but here is my take:

By saying to only allow more popular, you are restricting suggestions to only those that have a higher instance frequency in the index. The score is still by edit distance, but only terms with a higher frequency than the term passed will be suggested. I agree this odd - it means you should only pass words in that you know are misspelled. You cant count on the spellchecker to kind of do that for you as it does without the more popular setting on.

So that is leaving you with a nasty suggestion. But it looks like the edit distance for that suggestion is larger. What you might try is adjusting the threshold (the min edit distance) to be a bit higher. That may restrict that suggestion. Its not a great solution though. Its likely to suggest something else :) Ideally, the spell checker should probably be better at not suggesting when you have chosen a good word. It doesn't care you have a good word already - it sees another word with greater frequency and within the edit distance allowed.

If you don't set the more popular setting, upon finding a word in the index, the Spell checker returns the word passed in. With the more popular setting on, you get the results you see - its still suggests, but it specifically will not suggest the word you passed in itself (the comment says, 'that would be silly'). So you will likely see bad suggestions for correct words with this setting.

- Mark

Nicholas Piasecki wrote:
Hello All,

I'm new to Solr, so forgive me if I'm overlooking something obvious. My
observation is that the spellcheck.onlyMorePopular property of the
SpellCheckComponent seems to not do what I expect.

If I send the query "calvin klien" to my data store, then the spell
checker correctly suggests "klein" for "klien," and running the new
"calvin klein" query returns the expected many product results.

However, when sending the correct query of "calvin klein," the spell
checker will suggest "cin2" (another brand name in our data store) for
"klein," and running that new "calvin cin2" collated query obviously
returns zero results.

It would seem to me that the "onlyMorePopular" property, when set to
true, only performs its calculation of popularity on the particular
misspelled word alone, and not the query as a whole. Since there are
indeed more C-IN2 brand products in our database, it returns "cin2" has
a spelling correction for "klein," seeing that the "cin2" token alone
returns many results but not bothering to check that "calvin cin2"
returns none.
A less astonishing behavior would be for it to suggest "cin2", test to
see how many hits "calvin cin2" returns, see that it returns less than
"calvin klein", and then exclude that suggestion because it is not more
popular in the context of the original query.

So:

1 - Is my analysis correct? Is this really how it works?

2 - Is there a configuration setting that I can do to make the spell
checker use the desired behavior? Or, should I just immediately submit a
request with its correlated suggestion with zero rows and do a
comparison on the results, effectively performing the "onlyMorePopular"
calculation myself?

Many thanks; so far, Solr is proving to be an excellent product!

V/R,
Nicholas Piasecki

Software Developer
Skiviez, Inc.
1-800-628-1693 x6003
n...@skiviez.com



Reply via email to