On Oct 6, 2008, at 3:51 AM, Martin Grotzke wrote:

Hi Jason,

what about multi-word searches like "harry potter"? When I do a search
in our index for "harry poter", I get the suggestion "harry
spotter" (using spellcheck.collate=true and jarowinkler distance).
Searching for "harry spotter" (we're searching AND, not OR) then gives
no results. I asume that this is because suggestions are done for words
separately, and this does not require that both/all suggestions are
contained in the same document.


Yeah, the SpellCheckComponent is not phrase aware. My guess would be that you would somehow need a QueryConverter (see http://wiki.apache.org/solr/SpellCheckComponent) that preserved phrases as a single token. Likewise, you would need that on your indexing side as well for the spell checker. In short, I suppose it's possible, but it would be work. You probably could use the shingle filter (token based n-grams).

Alternatively, by using extendedResults, you can get back the frequency of each of the words, and then you could decide whether the collation is going to have any results assuming they are all or'd together. For phrases and AND queries, I'm not sure. It's doable, I'm sure, but it would be a lot more involved.


I wonder what's the standard approach for searches with multiple words.
Are these working ok for you?

Cheers,
Martin

On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote:
Hi Martin,

I'm a relative newbie to solr, have been playing with the spellcheck
component and seem to have it working. I certainly can't explain what all
is going on, but with any luck, I can help you get the spellchecker
up-and-running.  Additional replies in-lined below.

On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke <[EMAIL PROTECTED]
wrote:

Now I'm thinking about the source-field in the spellchecker ("spell"):
how should fields be analyzed during indexing, and how should the
queryAnalyzerFieldType be configured.


I followed the conventions in the default solrconfig.xml and schema.xml
files.  So I created a "textSpell" field type (schema.xml):

   <!-- field type for the spell checker which doesn't stem -->
   <fieldtype name="textSpell" class="solr.TextField"
positionIncrementGap="100">
     <analyzer>
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
     </analyzer>
   </fieldtype>

and used this for the queryAnalyzerFieldType. I also created a spellField to store the text I want to spell check against and used the same analyzer (figuring that the query and indexed data should be analyzed the same way)
(schema.xml):

  <!-- Spell check field -->
<field name="spellField" type="textSpell" indexed="true" stored="true" />



If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them (the field "brand") directly to the "spell" field. The "spell" field is of
type "string".


We're copying description to spellField. I'd recommend using a type like the above textSpell type since "The StringField type is not analyzed, but
indexed/stored verbatim" (schema.xml):

 <copyField source="description" dest="spellField" />

Other fields like e.g. the product title I would first copy to some
whitespaceTokinized field (field type with WhitespaceTokenizerFactory)
and afterwards to the "spell" field. The product title might be e.g.
"Canon EOS 450D EF-S 18-55 mm".


Hmm... I'm not sure if this would work as I don't think the analyzer is
applied until after the copy is made.  FWIW, I've had trouble copying
multipe fields to spellField (i.e. adding a second copyField w/
dest="spellField"), so we just index the spellchecker on a single field...

Shouldn't this be a WhitespaceTokenizerFactory, or is it better to use a
StandardTokenizerFactory here?


I think if you use the same analyzer for indexing and queries, the
distinction probably isn't tremendously important. When I went searching, it looked like the StandardTokenizer split on non-letters. I'd guess the
rationale for using the StandardTokenizer is that it won't recommend
non-letter characters.  I was seeing some weirdness earlier (no
inserts/deletes), but that disappeared now that I'm using the
StandardTokenizer.

Cheers,

Jason
--
Martin Grotzke
http://www.javakaffee.de/blog/

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








Reply via email to