Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

Grant Ingersoll Mon, 06 Oct 2008 06:01:14 -0700


On Oct 6, 2008, at 3:51 AM, Martin Grotzke wrote:

Hi Jason,

what about multi-word searches like "harry potter"? When I do a search
in our index for "harry poter", I get the suggestion "harry
spotter" (using spellcheck.collate=true and jarowinkler distance).
Searching for "harry spotter" (we're searching AND, not OR) then gives

no results. I asume that this is because suggestions are done forwords

separately, and this does not require that both/all suggestions are
contained in the same document.

Yeah, the SpellCheckComponent is not phrase aware. My guess would bethat you would somehow need a QueryConverter (see http://wiki.apache.org/solr/SpellCheckComponent)that preserved phrases as a single token. Likewise, you would needthat on your indexing side as well for the spell checker. In short, Isuppose it's possible, but it would be work. You probably could usethe shingle filter (token based n-grams).

Alternatively, by using extendedResults, you can get back thefrequency of each of the words, and then you could decide whether thecollation is going to have any results assuming they are all or'dtogether. For phrases and AND queries, I'm not sure. It's doable,I'm sure, but it would be a lot more involved.

I wonder what's the standard approach for searches with multiplewords.
Are these working ok for you?

Cheers,
Martin

On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote:
Hi Martin,

I'm a relative newbie to solr, have been playing with the spellcheck
component and seem to have it working. I certainly can't explainwhat all
is going on, but with any luck, I can help you get the spellchecker
up-and-running.  Additional replies in-lined below.

On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke <[EMAIL PROTECTED]
wrote:
Now I'm thinking about the source-field in the spellchecker("spell"):
how should fields be analyzed during indexing, and how should the
queryAnalyzerFieldType be configured.
I followed the conventions in the default solrconfig.xml andschema.xml
files.  So I created a "textSpell" field type (schema.xml):

   
   <fieldtype name="textSpell" class="solr.TextField"
positionIncrementGap="100">
     <analyzer>
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
     </analyzer>
   </fieldtype>
and used this for the queryAnalyzerFieldType. I also created aspellFieldto store the text I want to spell check against and used the sameanalyzer(figuring that the query and indexed data should be analyzed thesame way)
(schema.xml):

  
<field name="spellField" type="textSpell" indexed="true"stored="true" />
If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them(thefield "brand") directly to the "spell" field. The "spell" field isof
type "string".
We're copying description to spellField. I'd recommend using atype likethe above textSpell type since "The StringField type is notanalyzed, but
indexed/stored verbatim" (schema.xml):

 <copyField source="description" dest="spellField" />

Other fields like e.g. the product title I would first copy to some
whitespaceTokinized field (field type withWhitespaceTokenizerFactory)
and afterwards to the "spell" field. The product title might be e.g.
"Canon EOS 450D EF-S 18-55 mm".
Hmm... I'm not sure if this would work as I don't think theanalyzer is
applied until after the copy is made.  FWIW, I've had trouble copying
multipe fields to spellField (i.e. adding a second copyField w/
dest="spellField"), so we just index the spellchecker on a singlefield...
Shouldn't this be a WhitespaceTokenizerFactory, or is it better touse a
StandardTokenizerFactory here?
I think if you use the same analyzer for indexing and queries, the
distinction probably isn't tremendously important. When I wentsearching,it looked like the StandardTokenizer split on non-letters. I'dguess the
rationale for using the StandardTokenizer is that it won't recommend
non-letter characters.  I was seeing some weirdness earlier (no
inserts/deletes), but that disappeared now that I'm using the
StandardTokenizer.

Cheers,

Jason
--
Martin Grotzke
http://www.javakaffee.de/blog/


--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

Reply via email to