Shalin Shekhar Mangar wrote:
The implementation is a bit more complicated.

1. Read all tokens from the specified field in the solr index.
2. Create n-grams of the terms read in #1 and index them into a separate
Lucene index (spellcheck index).
3. When asked for suggestions, create n-grams of the query terms, search the
spellcheck index and collects the top (by lucene score) 10*spellcheck.count
results.
4. If onlyMorePopular=true, determine frequency of each result in the solr
index and remove terms which have lesser frequency.
5. Compute the edit distance between the result and the query token.
6. Return the top spellcheck.count results (sorted by edit distance
descending) which are greater than specified accuracy.

Thanks, I think this makes things clear(er) now. I do agree that the documentation needs improvement on this point, as you said later in this thread. :)


Your primary use-case is not spellcheck at all but this might work with some
hacking. Fuzzy queries may be a better solution as Walter said. Storing, all
successful search queries may be hard to scale.

This is certainly true.

The drawback of fuzzy searching is that you get back exact and fuzzy hits together in one result set (correct me if I'm wrong). One could filter out the exact/fuzzy hits but this would make paging impossible.

The approach using KeywordTokenizer as you suggested before seems to be more promising to me. Unfortunately there seems to be no documentation for this (at least in conjunction with spell checking). If I understand this rightly, the tokenizer must be applied to the field in the search index (not the spell checking index). Is that correct?

Thanks,
Marcus

Reply via email to