Re: spellcheck.onlyMorePopular

Marcus Stratmann Mon, 16 Feb 2009 01:59:01 -0800

Shalin Shekhar Mangar wrote:

The implementation is a bit more complicated.


1. Read all tokens from the specified field in the solr index.
2. Create n-grams of the terms read in #1 and index them into a separate
Lucene index (spellcheck index).
3. When asked for suggestions, create n-grams of the query terms, search the
spellcheck index and collects the top (by lucene score) 10*spellcheck.count
results.
4. If onlyMorePopular=true, determine frequency of each result in the solr
index and remove terms which have lesser frequency.
5. Compute the edit distance between the result and the query token.
6. Return the top spellcheck.count results (sorted by edit distance
descending) which are greater than specified accuracy.

Thanks, I think this makes things clear(er) now. I do agree that thedocumentation needs improvement on this point, as you said later in thisthread. :)

Your primary use-case is not spellcheck at all but this might work with some
hacking. Fuzzy queries may be a better solution as Walter said. Storing, all
successful search queries may be hard to scale.


This is certainly true.

The drawback of fuzzy searching is that you get back exact and fuzzyhits together in one result set (correct me if I'm wrong). One couldfilter out the exact/fuzzy hits but this would make paging impossible.

The approach using KeywordTokenizer as you suggested before seems to bemore promising to me. Unfortunately there seems to be no documentationfor this (at least in conjunction with spell checking). If I understandthis rightly, the tokenizer must be applied to the field in the searchindex (not the spell checking index). Is that correct?


Thanks,
Marcus

Re: spellcheck.onlyMorePopular

Reply via email to