Thanx for your help so far, I just wanted to post my results here... In short: Now I use the ShingleFilter to create shingles when copying my fields into my field "spellMultiWords". For query time, I implemented a MultiWordSpellingQueryConverter that just leaves the query as is, so that there's only one token that is check for spelling suggestions.
Here's the detailed configuration: = schema.xml = <fieldType name="textSpellMultiWords" class="solr.TextField" positionIncrementGap="100" > <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> <field name="spellMultiWords" type="textSpellMultiWords" indexed="true" stored="true" multiValued="true"/> <copyField source="name" dest="spellMultiWords" /> <copyField source="cat" dest="spellMultiWords" /> ... and more ... = solrconfig.xml = <searchComponent name="spellcheckMultiWords" class="solr.SpellCheckComponent"> <!-- this is not used at all, can probably be omitted --> <str name="queryAnalyzerFieldType">textSpellMultiWords</str> <lst name="spellchecker"> <!-- Optional, it is required when more than one spellchecker is configured --> <str name="name">default</str> <str name="field">spellMultiWords</str> <str name="spellcheckIndexDir">./spellcheckerMultiWords1</str> <str name="accuracy">0.5</str> <str name="buildOnCommit">true</str> </lst> <lst name="spellchecker"> <str name="name">jarowinkler</str> <str name="field">spellMultiWords</str> <str name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance</str> <str name="spellcheckIndexDir">./spellcheckerMultiWords2</str> <str name="buildOnCommit">true</str> </lst> </searchComponent> <queryConverter name="queryConverter" class="my.proj.solr.MultiWordSpellingQueryConverter"/> = MultiWordSpellingQueryConverter = public class MultiWordSpellingQueryConverter extends QueryConverter { /** * Converts the original query string to a collection of Lucene Tokens. * * @param original the original query string * @return a Collection of Lucene Tokens */ public Collection<Token> convert( String original ) { if ( original == null ) { return Collections.emptyList(); } final Token token = new Token(0, original.length()); token.setTermBuffer( original ); return Arrays.asList( token ); } } There are some issues still to be resolved: - terms are lowercased in the index, there should happen some case restoration - we use stemming for our text field, so the spellchecker might suggest searches, that lead to equal search results (e.g. the german2 stemmer stems both "hose" and "hosen" to "hos" -> "Hose" and "Hosen" give the same results) - inconsistent/strange sorting of suggestions (as described in http://www.nabble.com/spellcheck%3A-issues-td19845539.html). Cheers, Martin On Mon, 2008-10-06 at 22:45 +0200, Martin Grotzke wrote: > On Mon, 2008-10-06 at 09:00 -0400, Grant Ingersoll wrote: > > On Oct 6, 2008, at 3:51 AM, Martin Grotzke wrote: > > > > > Hi Jason, > > > > > > what about multi-word searches like "harry potter"? When I do a search > > > in our index for "harry poter", I get the suggestion "harry > > > spotter" (using spellcheck.collate=true and jarowinkler distance). > > > Searching for "harry spotter" (we're searching AND, not OR) then gives > > > no results. I asume that this is because suggestions are done for > > > words > > > separately, and this does not require that both/all suggestions are > > > contained in the same document. > > > > > > > Yeah, the SpellCheckComponent is not phrase aware. My guess would be > > that you would somehow need a QueryConverter (see > > http://wiki.apache.org/solr/SpellCheckComponent) > > that preserved phrases as a single token. Likewise, you would need > > that on your indexing side as well for the spell checker. In short, I > > suppose it's possible, but it would be work. You probably could use > > the shingle filter (token based n-grams). > I also thought about s.th. like this, and also stumbled over the > ShingleFilter :) > > So I would change the "spell" field to use the ShingleFilter? > > Did I understand the answer to the posting "chaining copyFields" > correctly, that I cannot pipe the title through some "shingledTitle" > field and copy it afterwards to the "spell" field (while other fields > like brand are copied directly to the spell field)? > > Thanx && cheers, > Martin > > > > > > Alternatively, by using extendedResults, you can get back the > > frequency of each of the words, and then you could decide whether the > > collation is going to have any results assuming they are all or'd > > together. For phrases and AND queries, I'm not sure. It's doable, > > I'm sure, but it would be a lot more involved. > > > > > > > I wonder what's the standard approach for searches with multiple > > > words. > > > Are these working ok for you? > > > > > > Cheers, > > > Martin > > > > > > On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote: > > >> Hi Martin, > > >> > > >> I'm a relative newbie to solr, have been playing with the spellcheck > > >> component and seem to have it working. I certainly can't explain > > >> what all > > >> is going on, but with any luck, I can help you get the spellchecker > > >> up-and-running. Additional replies in-lined below. > > >> > > >> On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke <[EMAIL PROTECTED] > > >>> wrote: > > >> > > >>> Now I'm thinking about the source-field in the spellchecker > > >>> ("spell"): > > >>> how should fields be analyzed during indexing, and how should the > > >>> queryAnalyzerFieldType be configured. > > >> > > >> > > >> I followed the conventions in the default solrconfig.xml and > > >> schema.xml > > >> files. So I created a "textSpell" field type (schema.xml): > > >> > > >> <!-- field type for the spell checker which doesn't stem --> > > >> <fieldtype name="textSpell" class="solr.TextField" > > >> positionIncrementGap="100"> > > >> <analyzer> > > >> <tokenizer class="solr.StandardTokenizerFactory"/> > > >> <filter class="solr.LowerCaseFilterFactory"/> > > >> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > > >> </analyzer> > > >> </fieldtype> > > >> > > >> and used this for the queryAnalyzerFieldType. I also created a > > >> spellField > > >> to store the text I want to spell check against and used the same > > >> analyzer > > >> (figuring that the query and indexed data should be analyzed the > > >> same way) > > >> (schema.xml): > > >> > > >> <!-- Spell check field --> > > >> <field name="spellField" type="textSpell" indexed="true" > > >> stored="true" /> > > >> > > >> > > >> > > >>> If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them > > >>> (the > > >>> field "brand") directly to the "spell" field. The "spell" field is > > >>> of > > >>> type "string". > > >> > > >> > > >> We're copying description to spellField. I'd recommend using a > > >> type like > > >> the above textSpell type since "The StringField type is not > > >> analyzed, but > > >> indexed/stored verbatim" (schema.xml): > > >> > > >> <copyField source="description" dest="spellField" /> > > >> > > >> Other fields like e.g. the product title I would first copy to some > > >>> whitespaceTokinized field (field type with > > >>> WhitespaceTokenizerFactory) > > >>> and afterwards to the "spell" field. The product title might be e.g. > > >>> "Canon EOS 450D EF-S 18-55 mm". > > >> > > >> > > >> Hmm... I'm not sure if this would work as I don't think the > > >> analyzer is > > >> applied until after the copy is made. FWIW, I've had trouble copying > > >> multipe fields to spellField (i.e. adding a second copyField w/ > > >> dest="spellField"), so we just index the spellchecker on a single > > >> field... > > >> > > >> Shouldn't this be a WhitespaceTokenizerFactory, or is it better to > > >> use a > > >>> StandardTokenizerFactory here? > > >> > > >> > > >> I think if you use the same analyzer for indexing and queries, the > > >> distinction probably isn't tremendously important. When I went > > >> searching, > > >> it looked like the StandardTokenizer split on non-letters. I'd > > >> guess the > > >> rationale for using the StandardTokenizer is that it won't recommend > > >> non-letter characters. I was seeing some weirdness earlier (no > > >> inserts/deletes), but that disappeared now that I'm using the > > >> StandardTokenizer. > > >> > > >> Cheers, > > >> > > >> Jason > > > -- > > > Martin Grotzke > > > http://www.javakaffee.de/blog/ > > > > -------------------------- > > Grant Ingersoll > > > > Lucene Helpful Hints: > > http://wiki.apache.org/lucene-java/BasicsOfPerformance > > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > > > > > > > > -- Martin Grotzke http://www.javakaffee.de/blog/
signature.asc
Description: This is a digitally signed message part