On Oct 6, 2008, at 3:51 AM, Martin Grotzke wrote:
Hi Jason,
what about multi-word searches like "harry potter"? When I do a search
in our index for "harry poter", I get the suggestion "harry
spotter" (using spellcheck.collate=true and jarowinkler distance).
Searching for "harry spotter" (we're searching AND, not OR) then gives
no results. I asume that this is because suggestions are done for
words
separately, and this does not require that both/all suggestions are
contained in the same document.
Yeah, the SpellCheckComponent is not phrase aware. My guess would be
that you would somehow need a QueryConverter (see http://wiki.apache.org/solr/SpellCheckComponent)
that preserved phrases as a single token. Likewise, you would need
that on your indexing side as well for the spell checker. In short, I
suppose it's possible, but it would be work. You probably could use
the shingle filter (token based n-grams).
Alternatively, by using extendedResults, you can get back the
frequency of each of the words, and then you could decide whether the
collation is going to have any results assuming they are all or'd
together. For phrases and AND queries, I'm not sure. It's doable,
I'm sure, but it would be a lot more involved.
I wonder what's the standard approach for searches with multiple
words.
Are these working ok for you?
Cheers,
Martin
On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote:
Hi Martin,
I'm a relative newbie to solr, have been playing with the spellcheck
component and seem to have it working. I certainly can't explain
what all
is going on, but with any luck, I can help you get the spellchecker
up-and-running. Additional replies in-lined below.
On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke <[EMAIL PROTECTED]
wrote:
Now I'm thinking about the source-field in the spellchecker
("spell"):
how should fields be analyzed during indexing, and how should the
queryAnalyzerFieldType be configured.
I followed the conventions in the default solrconfig.xml and
schema.xml
files. So I created a "textSpell" field type (schema.xml):
<!-- field type for the spell checker which doesn't stem -->
<fieldtype name="textSpell" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldtype>
and used this for the queryAnalyzerFieldType. I also created a
spellField
to store the text I want to spell check against and used the same
analyzer
(figuring that the query and indexed data should be analyzed the
same way)
(schema.xml):
<!-- Spell check field -->
<field name="spellField" type="textSpell" indexed="true"
stored="true" />
If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them
(the
field "brand") directly to the "spell" field. The "spell" field is
of
type "string".
We're copying description to spellField. I'd recommend using a
type like
the above textSpell type since "The StringField type is not
analyzed, but
indexed/stored verbatim" (schema.xml):
<copyField source="description" dest="spellField" />
Other fields like e.g. the product title I would first copy to some
whitespaceTokinized field (field type with
WhitespaceTokenizerFactory)
and afterwards to the "spell" field. The product title might be e.g.
"Canon EOS 450D EF-S 18-55 mm".
Hmm... I'm not sure if this would work as I don't think the
analyzer is
applied until after the copy is made. FWIW, I've had trouble copying
multipe fields to spellField (i.e. adding a second copyField w/
dest="spellField"), so we just index the spellchecker on a single
field...
Shouldn't this be a WhitespaceTokenizerFactory, or is it better to
use a
StandardTokenizerFactory here?
I think if you use the same analyzer for indexing and queries, the
distinction probably isn't tremendously important. When I went
searching,
it looked like the StandardTokenizer split on non-letters. I'd
guess the
rationale for using the StandardTokenizer is that it won't recommend
non-letter characters. I was seeing some weirdness earlier (no
inserts/deletes), but that disappeared now that I'm using the
StandardTokenizer.
Cheers,
Jason
--
Martin Grotzke
http://www.javakaffee.de/blog/
--------------------------
Grant Ingersoll
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ