How to tokenize/analyze docs for the spellchecker - at indexing and query time

Martin Grotzke Wed, 01 Oct 2008 04:12:26 -0700

Hi,

I'm just starting with the spellchecker component provided by solr - it
is really cool!


Now I'm thinking about the source-field in the spellchecker ("spell"):
how should fields be analyzed during indexing, and how should the
queryAnalyzerFieldType be configured.

If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them (the
field "brand") directly to the "spell" field. The "spell" field is of
type "string".

Other fields like e.g. the product title I would first copy to some
whitespaceTokinized field (field type with WhitespaceTokenizerFactory)
and afterwards to the "spell" field. The product title might be e.g.
"Canon EOS 450D EF-S 18-55 mm".

This is the process I have in mind during indexing (though I'm not sure
if some tokens/terms should be removed, but I'd asume that all terms
might be misspelled by the user).

Now when it comes to searching, the query should be analyzed using the
queryAnalyzerFieldType definition, which has a StandardTokenizerFactory
in the schema.xml of the solr example.

Shouldn't this be a WhitespaceTokenizerFactory, or is it better to use a
StandardTokenizerFactory here?

Or should I use a StandardTokenizerFactory for the "spell" field, so
that fields copied into this field get tokenized/analyzed in the same
way as the query will get tokenized/analyzed?

Do you have any experience with this and/or recommendations regarding
this?

Are there other things to consider?

Thanx for your help,
cheers,
Martin

signature.asc
Description: This is a digitally signed message part

How to tokenize/analyze docs for the spellchecker - at indexing and query time

Reply via email to