Hi Jeff, Have you seen PositionFilterFactory?: <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory>
Steve > -----Original Message----- > From: Jeff Rose [mailto:j...@globalorange.nl] > Sent: Thursday, September 02, 2010 9:06 AM > To: solr-user@lucene.apache.org > Subject: Re: shingles work in analyzer but not real data > > On Wed, Sep 1, 2010 at 3:35 PM, Robert Muir <rcm...@gmail.com> wrote: > > > On Wed, Sep 1, 2010 at 8:21 AM, Jeff Rose <j...@globalorange.nl> wrote: > > > > > Hi, > > > We are using SOLR to match query strings with a keyword database, > where > > > some of the keywords are actually more than one word. For example a > > > keyword > > > might be "apple pie" and we only want it to match for a query > containing > > > that word pair, but not one only containing "apple". Here is the > > relevant > > > piece of the schema.xml, defining the index and query pipelines: > > > > > > <fieldType name="text" class="solr.TextField" > > positionIncrementGap="100"> > > > <analyzer type="index"> > > > <tokenizer class="solr.PatternTokenizerFactory" pattern=";"/> > > > <filter class="solr.LowerCaseFilterFactory"/> > > > <filter class="solr.TrimFilterFactory" /> > > > </analyzer> > > > <analyzer type="query"> > > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > > <filter class="solr.LowerCaseFilterFactory"/> > > > <filter class="solr.TrimFilterFactory" /> > > > <filter class="solr.ShingleFilterFactory" /> > > > </analyzer> > > > </fieldType> > > > > > > In the analysis tool this schema looks like it works correctly. Our > > > multi-word keywords are indexed as a single entry, and then when a > search > > > phrase contains one of these multi-word keywords it is shingled and > > > matched. > > > Unfortunately, when we do the same queries on top of the actual index > it > > > responds with zero matches. I can see in the index histogram that the > > > terms > > > are correctly indexed from our mysql datasource containing the > keywords, > > > but > > > somehow the shingling doesn't appear to work on this live data. Does > > > anyone > > > have experience with shingling that might have some tips for us, or > > > otherwise advice for debugging the issue? > > > > > > > query-time shingling probably isnt working with the queryparser you are > > using, the default lucene one first splits on whitespace before sending > it > > to the analyzer: e.g. a query of foo bar is processed as > TokenStream(foo) + > > TokenStream(bar) > > > > so query-time shingling like this doesn't work as you expect for this > > reason. > > > Hi Robert, thanks for the response. I've looked into the query parsers a > bit and I did find that using the raw parser on a matching multi-word > keyword works correctly. I need to have shingling though, in order to > support query phrases. It seems odd to have the query parser emitting > tokens though. If this is the case why would we ever use the > WhitespaceTokenizer? Either way, do you know what the correct > configuration > should be to actually perform shingling as it is documented to work: > joining > adjacent tokens into a single search term? (e.g. "apple" "pie" should > become "apple pie") > > Thanks a lot for the help. > > -Jeff > > P.S. Markus, putting double quotes around the query doesn't seem to have > any > effect. It would be nice to have the analysis debug output on the actual > queries so that I could see what is being searched for after analysis...