Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

Walter Underwood Mon, 06 Oct 2008 08:34:18 -0700

This is why OR is a better choice. With AND, one miss means no results
at all. Spelling suggestions will never be good enough to make AND work.


wunder

On 10/6/08 12:51 AM, "Martin Grotzke" <[EMAIL PROTECTED]> wrote:

> Hi Jason,
> 
> what about multi-word searches like "harry potter"? When I do a search
> in our index for "harry poter", I get the suggestion "harry
> spotter" (using spellcheck.collate=true and jarowinkler distance).
> Searching for "harry spotter" (we're searching AND, not OR) then gives
> no results. I asume that this is because suggestions are done for words
> separately, and this does not require that both/all suggestions are
> contained in the same document.
> 
> I wonder what's the standard approach for searches with multiple words.
> Are these working ok for you?
> 
> Cheers,
> Martin
> 
> On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote:
>> Hi Martin,
>> 
>> I'm a relative newbie to solr, have been playing with the spellcheck
>> component and seem to have it working.  I certainly can't explain what all
>> is going on, but with any luck, I can help you get the spellchecker
>> up-and-running.  Additional replies in-lined below.
>> 
>> On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke <[EMAIL PROTECTED]
>>> wrote:
>> 
>>> Now I'm thinking about the source-field in the spellchecker ("spell"):
>>> how should fields be analyzed during indexing, and how should the
>>> queryAnalyzerFieldType be configured.
>> 
>> 
>> I followed the conventions in the default solrconfig.xml and schema.xml
>> files.  So I created a "textSpell" field type (schema.xml):
>> 
>>     <!-- field type for the spell checker which doesn't stem -->
>>     <fieldtype name="textSpell" class="solr.TextField"
>> positionIncrementGap="100">
>>       <analyzer>
>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>       </analyzer>
>>     </fieldtype>
>> 
>> and used this for the queryAnalyzerFieldType.  I also created a spellField
>> to store the text I want to spell check against and used the same analyzer
>> (figuring that the query and indexed data should be analyzed the same way)
>> (schema.xml):
>> 
>>    <!-- Spell check field -->
>>    <field name="spellField" type="textSpell" indexed="true" stored="true" />
>> 
>> 
>> 
>>> If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them (the
>>> field "brand") directly to the "spell" field. The "spell" field is of
>>> type "string".
>> 
>> 
>> We're copying description to spellField.  I'd recommend using a type like
>> the above textSpell type since "The StringField type is not analyzed, but
>> indexed/stored verbatim" (schema.xml):
>> 
>>   <copyField source="description" dest="spellField" />
>> 
>> Other fields like e.g. the product title I would first copy to some
>>> whitespaceTokinized field (field type with WhitespaceTokenizerFactory)
>>> and afterwards to the "spell" field. The product title might be e.g.
>>> "Canon EOS 450D EF-S 18-55 mm".
>> 
>> 
>> Hmm... I'm not sure if this would work as I don't think the analyzer is
>> applied until after the copy is made.  FWIW, I've had trouble copying
>> multipe fields to spellField (i.e. adding a second copyField w/
>> dest="spellField"), so we just index the spellchecker on a single field...
>> 
>> Shouldn't this be a WhitespaceTokenizerFactory, or is it better to use a
>>> StandardTokenizerFactory here?
>> 
>> 
>> I think if you use the same analyzer for indexing and queries, the
>> distinction probably isn't tremendously important.  When I went searching,
>> it looked like the StandardTokenizer split on non-letters.  I'd guess the
>> rationale for using the StandardTokenizer is that it won't recommend
>> non-letter characters.  I was seeing some weirdness earlier (no
>> inserts/deletes), but that disappeared now that I'm using the
>> StandardTokenizer.
>> 
>> Cheers,
>> 
>> Jason

Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

Reply via email to