http://en.wikipedia.org/wiki/W-shingling
On Fri, Sep 3, 2010 at 6:19 AM, Steven A Rowe <sar...@syr.edu> wrote: > Hi Dennis, > > I took a stab at answering this question in the following java-user mailing > list post: > > http://www.lucidimagination.com/search/document/6cb7b54cce6872b3/lucene_indexes > > Steve > >> -----Original Message----- >> From: Dennis Gearon [mailto:gear...@sbcglobal.net] >> Sent: Friday, September 03, 2010 5:06 AM >> To: solr-user@lucene.apache.org >> Subject: Re: shingles work in analyzer but not real data >> >> Anyone got a definitive, authoritative link to the definition of a >> 'shingle' in search engine results/technology? >> >> >> Dennis Gearon >> >> Signature Warning >> ---------------- >> EARTH has a Right To Life, >> otherwise we all die. >> >> Read 'Hot, Flat, and Crowded' >> Laugh at http://www.yert.com/film.php >> >> >> --- On Fri, 9/3/10, Jeff Rose <j...@globalorange.nl> wrote: >> >> > From: Jeff Rose <j...@globalorange.nl> >> > Subject: Re: shingles work in analyzer but not real data >> > To: solr-user@lucene.apache.org >> > Date: Friday, September 3, 2010, 1:48 AM >> > Thanks Steven and Jonathan, we got it >> > working by using a combination of >> > quoting and the PositionFilterFactory, like is shown >> > below. The >> > documentation for the position filter doesn't make much >> > sense without >> > understanding more about how positioning of tokens is taken >> > into account, >> > but it appears to do the trick. Does anyone know why >> > position would matter >> > here? It seems like tokens would be emitted by a >> > tokenizer, filtered, >> > joined into pairwise tokens by the shingler, and then >> > matched against the >> > index. If position information is also important it >> > seems odd that this is >> > not discussed in the documentation.. (Same for the >> > pre-tokenizing done by >> > the query parser, before handing phrases to the >> > tokenizer...) >> > >> > Anyway, here is our final schema that works as long as we >> > put search phrases >> > in double quotes. Thanks for all the help! >> > >> > -Jeff >> > >> > <fieldType name="text" class="solr.TextField" >> > positionIncrementGap="100"> >> > <analyzer type="index"> >> > <tokenizer >> > class="solr.PatternTokenizerFactory" pattern=";"/> >> > <filter >> > class="solr.LowerCaseFilterFactory"/> >> > <filter >> > class="solr.TrimFilterFactory" /> >> > <filter >> > class="solr.LowerCaseFilterFactory"/> >> > <!-- <filter >> > class="solr.ShingleFilterFactory" outputUnigrams="true" >> > outputUnigramIfNoNgram="true" maxShingleSize="2"/> >> > --> >> > </analyzer> >> > <analyzer type="query"> >> > <tokenizer >> > class="solr.PatternTokenizerFactory" pattern="[.,?;: >> > !]"/> >> > <filter class="solr.LowerCaseFilterFactory"/> >> > <filter >> > class="solr.TrimFilterFactory" /> >> > <filter class="solr.ShingleFilterFactory"/> >> > <filter class="solr.PositionFilterFactory"/> >> > </analyzer> >> > </fieldType> >> > >> > >> > On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind <rochk...@jhu.edu> >> > wrote: >> > >> > > I've run into this before too. Both the dismax and >> > solr-lucene _query >> > > parsers_ will tokenize a query on whitespace _before_ >> > they pass the query to >> > > any field analyzers. >> > > There are some reasons for this, lots of things >> > wouldn't work if they >> > > didn't do this. >> > > >> > > But it makes your approach kind of hard. Try doing >> > your search as a phrase >> > > search with double quotes, "apple pie", I bet it'll >> > work then -- because >> > > both dismax and solr-lucene will respect the phrase >> > quotes and NOT tokenize >> > > the stuff inside there before it gets to the field >> > analyzers. >> > > >> > > So if non-tokenized fields like this are all that are >> > included in your >> > > search, and if you can get your client application to >> > just force phrase >> > > quoting of everything before sending to Solr, that >> > might work. Otherwise.... >> > > I don't know of a good solution. If you figure one >> > out, let me know. >> > > >> > > Jonathan >> > > >> > > >> > > Jeff Rose wrote: >> > > >> > >> Hi, >> > >> We are using SOLR to match query strings >> > with a keyword database, where >> > >> some of the keywords are actually more than one >> > word. For example a >> > >> keyword >> > >> might be "apple pie" and we only want it to match >> > for a query containing >> > >> that word pair, but not one only containing >> > "apple". Here is the relevant >> > >> piece of the schema.xml, defining the index and >> > query pipelines: >> > >> >> > >> <fieldType name="text" >> > class="solr.TextField" positionIncrementGap="100"> >> > >> <analyzer >> > type="index"> >> > >> <tokenizer >> > class="solr.PatternTokenizerFactory" pattern=";"/> >> > >> <filter >> > class="solr.LowerCaseFilterFactory"/> >> > >> <filter >> > class="solr.TrimFilterFactory" /> >> > >> </analyzer> >> > >> <analyzer >> > type="query"> >> > >> <tokenizer >> > class="solr.WhitespaceTokenizerFactory"/> >> > >> <filter >> > class="solr.LowerCaseFilterFactory"/> >> > >> <filter >> > class="solr.TrimFilterFactory" /> >> > >> <filter class="solr.ShingleFilterFactory" >> > /> >> > >> </analyzer> >> > >> </fieldType> >> > >> >> > >> In the analysis tool this schema looks like it >> > works correctly. Our >> > >> multi-word keywords are indexed as a single entry, >> > and then when a search >> > >> phrase contains one of these multi-word keywords >> > it is shingled and >> > >> matched. >> > >> Unfortunately, when we do the same queries >> > on top of the actual index it >> > >> responds with zero matches. I can see in the >> > index histogram that the >> > >> terms >> > >> are correctly indexed from our mysql datasource >> > containing the keywords, >> > >> but >> > >> somehow the shingling doesn't appear to work on >> > this live data. Does >> > >> anyone >> > >> have experience with shingling that might have >> > some tips for us, or >> > >> otherwise advice for debugging the issue? >> > >> >> > >> Thanks, >> > >> Jeff >> > >> >> > >> >> > >> >> > > >> > > -- Lance Norskog goks...@gmail.com