Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

Jason Rennie Fri, 03 Oct 2008 13:21:43 -0700

Hi Martin,

I'm a relative newbie to solr, have been playing with the spellcheck
component and seem to have it working.  I certainly can't explain what all
is going on, but with any luck, I can help you get the spellchecker
up-and-running.  Additional replies in-lined below.

On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke <[EMAIL PROTECTED]
> wrote:

> Now I'm thinking about the source-field in the spellchecker ("spell"):
> how should fields be analyzed during indexing, and how should the
> queryAnalyzerFieldType be configured.

I followed the conventions in the default solrconfig.xml and schema.xml
files.  So I created a "textSpell" field type (schema.xml):

    <!-- field type for the spell checker which doesn't stem -->
    <fieldtype name="textSpell" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldtype>

and used this for the queryAnalyzerFieldType.  I also created a spellField
to store the text I want to spell check against and used the same analyzer
(figuring that the query and indexed data should be analyzed the same way)
(schema.xml):

   <!-- Spell check field -->
   <field name="spellField" type="textSpell" indexed="true" stored="true" />

> If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them (the
> field "brand") directly to the "spell" field. The "spell" field is of
> type "string".

We're copying description to spellField.  I'd recommend using a type like
the above textSpell type since "The StringField type is not analyzed, but
indexed/stored verbatim" (schema.xml):

  <copyField source="description" dest="spellField" />

Other fields like e.g. the product title I would first copy to some
> whitespaceTokinized field (field type with WhitespaceTokenizerFactory)
> and afterwards to the "spell" field. The product title might be e.g.
> "Canon EOS 450D EF-S 18-55 mm".

Hmm... I'm not sure if this would work as I don't think the analyzer is
applied until after the copy is made.  FWIW, I've had trouble copying
multipe fields to spellField (i.e. adding a second copyField w/
dest="spellField"), so we just index the spellchecker on a single field...

Shouldn't this be a WhitespaceTokenizerFactory, or is it better to use a
> StandardTokenizerFactory here?

I think if you use the same analyzer for indexing and queries, the
distinction probably isn't tremendously important.  When I went searching,
it looked like the StandardTokenizer split on non-letters.  I'd guess the
rationale for using the StandardTokenizer is that it won't recommend
non-letter characters.  I was seeing some weirdness earlier (no
inserts/deletes), but that disappeared now that I'm using the
StandardTokenizer.

Cheers,

Jason

Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

Reply via email to