Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

Martin Grotzke Mon, 20 Oct 2008 08:26:10 -0700

Thanx for your help so far, I just wanted to post my results here...

In short: Now I use the ShingleFilter to create shingles when copying my
fields into my field "spellMultiWords". For query time, I implemented a
MultiWordSpellingQueryConverter that just leaves the query as is, so
that there's only one token that is check for spelling suggestions.


Here's the detailed configuration:

= schema.xml =
    <fieldType name="textSpellMultiWords" class="solr.TextField" 
positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="3" 
outputUnigrams="true"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

   <field name="spellMultiWords" type="textSpellMultiWords" indexed="true" 
stored="true" multiValued="true"/>

   <copyField source="name" dest="spellMultiWords" />
   <copyField source="cat" dest="spellMultiWords" />
   ... and more ...


= solrconfig.xml =
  
  <searchComponent name="spellcheckMultiWords" class="solr.SpellCheckComponent">

    <!-- this is not used at all, can probably be omitted -->
    <str name="queryAnalyzerFieldType">textSpellMultiWords</str>

    <lst name="spellchecker">
      <!-- Optional, it is required when more than one spellchecker is 
configured -->
      <str name="name">default</str>
      <str name="field">spellMultiWords</str>
      <str name="spellcheckIndexDir">./spellcheckerMultiWords1</str>
      <str name="accuracy">0.5</str>
      <str name="buildOnCommit">true</str>
    </lst>
    <lst name="spellchecker">
      <str name="name">jarowinkler</str>
      <str name="field">spellMultiWords</str>
      <str 
name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance</str>
      <str name="spellcheckIndexDir">./spellcheckerMultiWords2</str>
      <str name="buildOnCommit">true</str>
    </lst>
  </searchComponent>
  
  <queryConverter name="queryConverter" 
class="my.proj.solr.MultiWordSpellingQueryConverter"/>


= MultiWordSpellingQueryConverter =

public class MultiWordSpellingQueryConverter extends QueryConverter {
    
    /**
     * Converts the original query string to a collection of Lucene Tokens.
     * 
     * @param original the original query string
     * @return a Collection of Lucene Tokens
     */
    public Collection<Token> convert( String original ) {
        if ( original == null ) {
            return Collections.emptyList();
        }
        final Token token = new Token(0, original.length());
        token.setTermBuffer( original );
        return Arrays.asList( token );
    }
    
}



There are some issues still to be resolved:
- terms are lowercased in the index, there should happen some case
restoration
- we use stemming for our text field, so the spellchecker might suggest
searches, that lead to equal search results (e.g. the german2 stemmer
stems both "hose" and "hosen" to "hos" -> "Hose" and "Hosen" give the
same results)
- inconsistent/strange sorting of suggestions (as described in
http://www.nabble.com/spellcheck%3A-issues-td19845539.html).


Cheers,
Martin


On Mon, 2008-10-06 at 22:45 +0200, Martin Grotzke wrote:
> On Mon, 2008-10-06 at 09:00 -0400, Grant Ingersoll wrote: 
> > On Oct 6, 2008, at 3:51 AM, Martin Grotzke wrote:
> > 
> > > Hi Jason,
> > >
> > > what about multi-word searches like "harry potter"? When I do a search
> > > in our index for "harry poter", I get the suggestion "harry
> > > spotter" (using spellcheck.collate=true and jarowinkler distance).
> > > Searching for "harry spotter" (we're searching AND, not OR) then gives
> > > no results. I asume that this is because suggestions are done for  
> > > words
> > > separately, and this does not require that both/all suggestions are
> > > contained in the same document.
> > >
> > 
> > Yeah, the SpellCheckComponent is not phrase aware.  My guess would be  
> > that you would somehow need a QueryConverter (see 
> > http://wiki.apache.org/solr/SpellCheckComponent) 
> >    that preserved phrases as a single token.  Likewise, you would need  
> > that on your indexing side as well for the spell checker.  In short, I  
> > suppose it's possible, but it would be work.  You probably could use  
> > the shingle filter (token based n-grams).
> I also thought about s.th. like this, and also stumbled over the
> ShingleFilter :)
> 
> So I would change the "spell" field to use the ShingleFilter?
> 
> Did I understand the answer to the posting "chaining copyFields"
> correctly, that I cannot pipe the title through some "shingledTitle"
> field and copy it afterwards to the "spell" field (while other fields
> like brand are copied directly to the spell field)?
> 
> Thanx && cheers,
> Martin
> 
> 
> > 
> > Alternatively, by using extendedResults, you can get back the  
> > frequency of each of the words, and then you could decide whether the  
> > collation is going to have any results assuming they are all or'd  
> > together.  For phrases and AND queries, I'm not sure.  It's doable,  
> > I'm sure, but it would be a lot more involved.
> > 
> > 
> > > I wonder what's the standard approach for searches with multiple  
> > > words.
> > > Are these working ok for you?
> > >
> > > Cheers,
> > > Martin
> > >
> > > On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote:
> > >> Hi Martin,
> > >>
> > >> I'm a relative newbie to solr, have been playing with the spellcheck
> > >> component and seem to have it working.  I certainly can't explain  
> > >> what all
> > >> is going on, but with any luck, I can help you get the spellchecker
> > >> up-and-running.  Additional replies in-lined below.
> > >>
> > >> On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke <[EMAIL PROTECTED]
> > >>> wrote:
> > >>
> > >>> Now I'm thinking about the source-field in the spellchecker  
> > >>> ("spell"):
> > >>> how should fields be analyzed during indexing, and how should the
> > >>> queryAnalyzerFieldType be configured.
> > >>
> > >>
> > >> I followed the conventions in the default solrconfig.xml and  
> > >> schema.xml
> > >> files.  So I created a "textSpell" field type (schema.xml):
> > >>
> > >>    <!-- field type for the spell checker which doesn't stem -->
> > >>    <fieldtype name="textSpell" class="solr.TextField"
> > >> positionIncrementGap="100">
> > >>      <analyzer>
> > >>        <tokenizer class="solr.StandardTokenizerFactory"/>
> > >>        <filter class="solr.LowerCaseFilterFactory"/>
> > >>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > >>      </analyzer>
> > >>    </fieldtype>
> > >>
> > >> and used this for the queryAnalyzerFieldType.  I also created a  
> > >> spellField
> > >> to store the text I want to spell check against and used the same  
> > >> analyzer
> > >> (figuring that the query and indexed data should be analyzed the  
> > >> same way)
> > >> (schema.xml):
> > >>
> > >>   <!-- Spell check field -->
> > >>   <field name="spellField" type="textSpell" indexed="true"  
> > >> stored="true" />
> > >>
> > >>
> > >>
> > >>> If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them  
> > >>> (the
> > >>> field "brand") directly to the "spell" field. The "spell" field is  
> > >>> of
> > >>> type "string".
> > >>
> > >>
> > >> We're copying description to spellField.  I'd recommend using a  
> > >> type like
> > >> the above textSpell type since "The StringField type is not  
> > >> analyzed, but
> > >> indexed/stored verbatim" (schema.xml):
> > >>
> > >>  <copyField source="description" dest="spellField" />
> > >>
> > >> Other fields like e.g. the product title I would first copy to some
> > >>> whitespaceTokinized field (field type with  
> > >>> WhitespaceTokenizerFactory)
> > >>> and afterwards to the "spell" field. The product title might be e.g.
> > >>> "Canon EOS 450D EF-S 18-55 mm".
> > >>
> > >>
> > >> Hmm... I'm not sure if this would work as I don't think the  
> > >> analyzer is
> > >> applied until after the copy is made.  FWIW, I've had trouble copying
> > >> multipe fields to spellField (i.e. adding a second copyField w/
> > >> dest="spellField"), so we just index the spellchecker on a single  
> > >> field...
> > >>
> > >> Shouldn't this be a WhitespaceTokenizerFactory, or is it better to  
> > >> use a
> > >>> StandardTokenizerFactory here?
> > >>
> > >>
> > >> I think if you use the same analyzer for indexing and queries, the
> > >> distinction probably isn't tremendously important.  When I went  
> > >> searching,
> > >> it looked like the StandardTokenizer split on non-letters.  I'd  
> > >> guess the
> > >> rationale for using the StandardTokenizer is that it won't recommend
> > >> non-letter characters.  I was seeing some weirdness earlier (no
> > >> inserts/deletes), but that disappeared now that I'm using the
> > >> StandardTokenizer.
> > >>
> > >> Cheers,
> > >>
> > >> Jason
> > > -- 
> > > Martin Grotzke
> > > http://www.javakaffee.de/blog/
> > 
> > --------------------------
> > Grant Ingersoll
> > 
> > Lucene Helpful Hints:
> > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > http://wiki.apache.org/lucene-java/LuceneFAQ
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/

signature.asc
Description: This is a digitally signed message part

Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

Reply via email to