Hi Dennis, I took a stab at answering this question in the following java-user mailing list post:
http://www.lucidimagination.com/search/document/6cb7b54cce6872b3/lucene_indexes Steve > -----Original Message----- > From: Dennis Gearon [mailto:gear...@sbcglobal.net] > Sent: Friday, September 03, 2010 5:06 AM > To: solr-user@lucene.apache.org > Subject: Re: shingles work in analyzer but not real data > > Anyone got a definitive, authoritative link to the definition of a > 'shingle' in search engine results/technology? > > > Dennis Gearon > > Signature Warning > ---------------- > EARTH has a Right To Life, > otherwise we all die. > > Read 'Hot, Flat, and Crowded' > Laugh at http://www.yert.com/film.php > > > --- On Fri, 9/3/10, Jeff Rose <j...@globalorange.nl> wrote: > > > From: Jeff Rose <j...@globalorange.nl> > > Subject: Re: shingles work in analyzer but not real data > > To: solr-user@lucene.apache.org > > Date: Friday, September 3, 2010, 1:48 AM > > Thanks Steven and Jonathan, we got it > > working by using a combination of > > quoting and the PositionFilterFactory, like is shown > > below. The > > documentation for the position filter doesn't make much > > sense without > > understanding more about how positioning of tokens is taken > > into account, > > but it appears to do the trick. Does anyone know why > > position would matter > > here? It seems like tokens would be emitted by a > > tokenizer, filtered, > > joined into pairwise tokens by the shingler, and then > > matched against the > > index. If position information is also important it > > seems odd that this is > > not discussed in the documentation.. (Same for the > > pre-tokenizing done by > > the query parser, before handing phrases to the > > tokenizer...) > > > > Anyway, here is our final schema that works as long as we > > put search phrases > > in double quotes. Thanks for all the help! > > > > -Jeff > > > > <fieldType name="text" class="solr.TextField" > > positionIncrementGap="100"> > > <analyzer type="index"> > > <tokenizer > > class="solr.PatternTokenizerFactory" pattern=";"/> > > <filter > > class="solr.LowerCaseFilterFactory"/> > > <filter > > class="solr.TrimFilterFactory" /> > > <filter > > class="solr.LowerCaseFilterFactory"/> > > <!-- <filter > > class="solr.ShingleFilterFactory" outputUnigrams="true" > > outputUnigramIfNoNgram="true" maxShingleSize="2"/> > > --> > > </analyzer> > > <analyzer type="query"> > > <tokenizer > > class="solr.PatternTokenizerFactory" pattern="[.,?;: > > !]"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter > > class="solr.TrimFilterFactory" /> > > <filter class="solr.ShingleFilterFactory"/> > > <filter class="solr.PositionFilterFactory"/> > > </analyzer> > > </fieldType> > > > > > > On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind <rochk...@jhu.edu> > > wrote: > > > > > I've run into this before too. Both the dismax and > > solr-lucene _query > > > parsers_ will tokenize a query on whitespace _before_ > > they pass the query to > > > any field analyzers. > > > There are some reasons for this, lots of things > > wouldn't work if they > > > didn't do this. > > > > > > But it makes your approach kind of hard. Try doing > > your search as a phrase > > > search with double quotes, "apple pie", I bet it'll > > work then -- because > > > both dismax and solr-lucene will respect the phrase > > quotes and NOT tokenize > > > the stuff inside there before it gets to the field > > analyzers. > > > > > > So if non-tokenized fields like this are all that are > > included in your > > > search, and if you can get your client application to > > just force phrase > > > quoting of everything before sending to Solr, that > > might work. Otherwise.... > > > I don't know of a good solution. If you figure one > > out, let me know. > > > > > > Jonathan > > > > > > > > > Jeff Rose wrote: > > > > > >> Hi, > > >> We are using SOLR to match query strings > > with a keyword database, where > > >> some of the keywords are actually more than one > > word. For example a > > >> keyword > > >> might be "apple pie" and we only want it to match > > for a query containing > > >> that word pair, but not one only containing > > "apple". Here is the relevant > > >> piece of the schema.xml, defining the index and > > query pipelines: > > >> > > >> <fieldType name="text" > > class="solr.TextField" positionIncrementGap="100"> > > >> <analyzer > > type="index"> > > >> <tokenizer > > class="solr.PatternTokenizerFactory" pattern=";"/> > > >> <filter > > class="solr.LowerCaseFilterFactory"/> > > >> <filter > > class="solr.TrimFilterFactory" /> > > >> </analyzer> > > >> <analyzer > > type="query"> > > >> <tokenizer > > class="solr.WhitespaceTokenizerFactory"/> > > >> <filter > > class="solr.LowerCaseFilterFactory"/> > > >> <filter > > class="solr.TrimFilterFactory" /> > > >> <filter class="solr.ShingleFilterFactory" > > /> > > >> </analyzer> > > >> </fieldType> > > >> > > >> In the analysis tool this schema looks like it > > works correctly. Our > > >> multi-word keywords are indexed as a single entry, > > and then when a search > > >> phrase contains one of these multi-word keywords > > it is shingled and > > >> matched. > > >> Unfortunately, when we do the same queries > > on top of the actual index it > > >> responds with zero matches. I can see in the > > index histogram that the > > >> terms > > >> are correctly indexed from our mysql datasource > > containing the keywords, > > >> but > > >> somehow the shingling doesn't appear to work on > > this live data. Does > > >> anyone > > >> have experience with shingling that might have > > some tips for us, or > > >> otherwise advice for debugging the issue? > > >> > > >> Thanks, > > >> Jeff > > >> > > >> > > >> > > > > >