RE: shingles work in analyzer but not real data

Steven A Rowe Fri, 03 Sep 2010 06:21:03 -0700

Hi Dennis,

I took a stab at answering this question in the following java-user mailing 
list post:


http://www.lucidimagination.com/search/document/6cb7b54cce6872b3/lucene_indexes

Steve

> -----Original Message-----
> From: Dennis Gearon [mailto:gear...@sbcglobal.net]
> Sent: Friday, September 03, 2010 5:06 AM
> To: solr-user@lucene.apache.org
> Subject: Re: shingles work in analyzer but not real data
> 
> Anyone got a definitive, authoritative link to the definition of a
> 'shingle' in search engine results/technology?
> 
> 
> Dennis Gearon
> 
> Signature Warning
> ----------------
> EARTH has a Right To Life,
>   otherwise we all die.
> 
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
> 
> 
> --- On Fri, 9/3/10, Jeff Rose <j...@globalorange.nl> wrote:
> 
> > From: Jeff Rose <j...@globalorange.nl>
> > Subject: Re: shingles work in analyzer but not real data
> > To: solr-user@lucene.apache.org
> > Date: Friday, September 3, 2010, 1:48 AM
> > Thanks Steven and Jonathan, we got it
> > working by using a combination of
> > quoting and the PositionFilterFactory, like is shown
> > below.  The
> > documentation for the position filter doesn't make much
> > sense without
> > understanding more about how positioning of tokens is taken
> > into account,
> > but it appears to do the trick.  Does anyone know why
> > position would matter
> > here?  It seems like tokens would be emitted by a
> > tokenizer, filtered,
> > joined into pairwise tokens by the shingler, and then
> > matched against the
> > index.  If position information is also important it
> > seems odd that this is
> > not discussed in the documentation..  (Same for the
> > pre-tokenizing done by
> > the query parser, before handing phrases to the
> > tokenizer...)
> >
> > Anyway, here is our final schema that works as long as we
> > put search phrases
> > in double quotes.  Thanks for all the help!
> >
> > -Jeff
> >
> >  <fieldType name="text" class="solr.TextField"
> > positionIncrementGap="100">
> >       <analyzer type="index">
> >         <tokenizer
> > class="solr.PatternTokenizerFactory" pattern=";"/>
> >         <filter
> > class="solr.LowerCaseFilterFactory"/>
> >         <filter
> > class="solr.TrimFilterFactory" />
> >         <filter
> > class="solr.LowerCaseFilterFactory"/>
> >         <!-- <filter
> > class="solr.ShingleFilterFactory" outputUnigrams="true"
> > outputUnigramIfNoNgram="true" maxShingleSize="2"/>
> > -->
> >       </analyzer>
> >       <analyzer type="query">
> >         <tokenizer
> > class="solr.PatternTokenizerFactory" pattern="[.,?;:
> > !]"/>
> >  <filter class="solr.LowerCaseFilterFactory"/>
> >          <filter
> > class="solr.TrimFilterFactory" />
> >  <filter class="solr.ShingleFilterFactory"/>
> >  <filter class="solr.PositionFilterFactory"/>
> >       </analyzer>
> >     </fieldType>
> >
> >
> > On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind <rochk...@jhu.edu>
> > wrote:
> >
> > > I've run into this before too. Both the dismax and
> > solr-lucene _query
> > > parsers_ will tokenize a query on whitespace _before_
> > they pass the query to
> > > any field analyzers.
> > > There are some reasons for this, lots of things
> > wouldn't work if they
> > > didn't do this.
> > >
> > > But it makes your approach kind of hard. Try doing
> > your search as a phrase
> > > search with double quotes, "apple pie", I bet it'll
> > work then -- because
> > > both dismax and solr-lucene will respect the phrase
> > quotes and NOT tokenize
> > > the stuff inside there before it gets to the field
> > analyzers.
> > >
> > > So if non-tokenized fields like this are all that are
> > included in your
> > > search, and if you can get your client application to
> > just force phrase
> > > quoting of everything before sending to Solr, that
> > might work. Otherwise....
> > > I don't know of a good solution. If you figure one
> > out, let me know.
> > >
> > > Jonathan
> > >
> > >
> > > Jeff Rose wrote:
> > >
> > >> Hi,
> > >>  We are using SOLR to match query strings
> > with a keyword database, where
> > >> some of the keywords are actually more than one
> > word.  For example a
> > >> keyword
> > >> might be "apple pie" and we only want it to match
> > for a query containing
> > >> that word pair, but not one only containing
> > "apple".  Here is the relevant
> > >> piece of the schema.xml, defining the index and
> > query pipelines:
> > >>
> > >>  <fieldType name="text"
> > class="solr.TextField" positionIncrementGap="100">
> > >>     <analyzer
> > type="index">
> > >>       <tokenizer
> > class="solr.PatternTokenizerFactory" pattern=";"/>
> > >>        <filter
> > class="solr.LowerCaseFilterFactory"/>
> > >>        <filter
> > class="solr.TrimFilterFactory" />
> > >>     </analyzer>
> > >>     <analyzer
> > type="query">
> > >>        <tokenizer
> > class="solr.WhitespaceTokenizerFactory"/>
> > >> <filter
> > class="solr.LowerCaseFilterFactory"/>
> > >>        <filter
> > class="solr.TrimFilterFactory" />
> > >> <filter class="solr.ShingleFilterFactory"
> > />
> > >>      </analyzer>
> > >>   </fieldType>
> > >>
> > >> In the analysis tool this schema looks like it
> > works correctly.  Our
> > >> multi-word keywords are indexed as a single entry,
> > and then when a search
> > >> phrase contains one of these multi-word keywords
> > it is shingled and
> > >> matched.
> > >>  Unfortunately, when we do the same queries
> > on top of the actual index it
> > >> responds with zero matches.  I can see in the
> > index histogram that the
> > >> terms
> > >> are correctly indexed from our mysql datasource
> > containing the keywords,
> > >> but
> > >> somehow the shingling doesn't appear to work on
> > this live data.  Does
> > >> anyone
> > >> have experience with shingling that might have
> > some tips for us, or
> > >> otherwise advice for debugging the issue?
> > >>
> > >> Thanks,
> > >> Jeff
> > >>
> > >>
> > >>
> > >
> >

RE: shingles work in analyzer but not real data

Reply via email to