http://en.wikipedia.org/wiki/W-shingling

On Fri, Sep 3, 2010 at 6:19 AM, Steven A Rowe <sar...@syr.edu> wrote:
> Hi Dennis,
>
> I took a stab at answering this question in the following java-user mailing 
> list post:
>
> http://www.lucidimagination.com/search/document/6cb7b54cce6872b3/lucene_indexes
>
> Steve
>
>> -----Original Message-----
>> From: Dennis Gearon [mailto:gear...@sbcglobal.net]
>> Sent: Friday, September 03, 2010 5:06 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: shingles work in analyzer but not real data
>>
>> Anyone got a definitive, authoritative link to the definition of a
>> 'shingle' in search engine results/technology?
>>
>>
>> Dennis Gearon
>>
>> Signature Warning
>> ----------------
>> EARTH has a Right To Life,
>>   otherwise we all die.
>>
>> Read 'Hot, Flat, and Crowded'
>> Laugh at http://www.yert.com/film.php
>>
>>
>> --- On Fri, 9/3/10, Jeff Rose <j...@globalorange.nl> wrote:
>>
>> > From: Jeff Rose <j...@globalorange.nl>
>> > Subject: Re: shingles work in analyzer but not real data
>> > To: solr-user@lucene.apache.org
>> > Date: Friday, September 3, 2010, 1:48 AM
>> > Thanks Steven and Jonathan, we got it
>> > working by using a combination of
>> > quoting and the PositionFilterFactory, like is shown
>> > below.  The
>> > documentation for the position filter doesn't make much
>> > sense without
>> > understanding more about how positioning of tokens is taken
>> > into account,
>> > but it appears to do the trick.  Does anyone know why
>> > position would matter
>> > here?  It seems like tokens would be emitted by a
>> > tokenizer, filtered,
>> > joined into pairwise tokens by the shingler, and then
>> > matched against the
>> > index.  If position information is also important it
>> > seems odd that this is
>> > not discussed in the documentation..  (Same for the
>> > pre-tokenizing done by
>> > the query parser, before handing phrases to the
>> > tokenizer...)
>> >
>> > Anyway, here is our final schema that works as long as we
>> > put search phrases
>> > in double quotes.  Thanks for all the help!
>> >
>> > -Jeff
>> >
>> >  <fieldType name="text" class="solr.TextField"
>> > positionIncrementGap="100">
>> >       <analyzer type="index">
>> >         <tokenizer
>> > class="solr.PatternTokenizerFactory" pattern=";"/>
>> >         <filter
>> > class="solr.LowerCaseFilterFactory"/>
>> >         <filter
>> > class="solr.TrimFilterFactory" />
>> >         <filter
>> > class="solr.LowerCaseFilterFactory"/>
>> >         <!-- <filter
>> > class="solr.ShingleFilterFactory" outputUnigrams="true"
>> > outputUnigramIfNoNgram="true" maxShingleSize="2"/>
>> > -->
>> >       </analyzer>
>> >       <analyzer type="query">
>> >         <tokenizer
>> > class="solr.PatternTokenizerFactory" pattern="[.,?;:
>> > !]"/>
>> >  <filter class="solr.LowerCaseFilterFactory"/>
>> >          <filter
>> > class="solr.TrimFilterFactory" />
>> >  <filter class="solr.ShingleFilterFactory"/>
>> >  <filter class="solr.PositionFilterFactory"/>
>> >       </analyzer>
>> >     </fieldType>
>> >
>> >
>> > On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind <rochk...@jhu.edu>
>> > wrote:
>> >
>> > > I've run into this before too. Both the dismax and
>> > solr-lucene _query
>> > > parsers_ will tokenize a query on whitespace _before_
>> > they pass the query to
>> > > any field analyzers.
>> > > There are some reasons for this, lots of things
>> > wouldn't work if they
>> > > didn't do this.
>> > >
>> > > But it makes your approach kind of hard. Try doing
>> > your search as a phrase
>> > > search with double quotes, "apple pie", I bet it'll
>> > work then -- because
>> > > both dismax and solr-lucene will respect the phrase
>> > quotes and NOT tokenize
>> > > the stuff inside there before it gets to the field
>> > analyzers.
>> > >
>> > > So if non-tokenized fields like this are all that are
>> > included in your
>> > > search, and if you can get your client application to
>> > just force phrase
>> > > quoting of everything before sending to Solr, that
>> > might work. Otherwise....
>> > > I don't know of a good solution. If you figure one
>> > out, let me know.
>> > >
>> > > Jonathan
>> > >
>> > >
>> > > Jeff Rose wrote:
>> > >
>> > >> Hi,
>> > >>  We are using SOLR to match query strings
>> > with a keyword database, where
>> > >> some of the keywords are actually more than one
>> > word.  For example a
>> > >> keyword
>> > >> might be "apple pie" and we only want it to match
>> > for a query containing
>> > >> that word pair, but not one only containing
>> > "apple".  Here is the relevant
>> > >> piece of the schema.xml, defining the index and
>> > query pipelines:
>> > >>
>> > >>  <fieldType name="text"
>> > class="solr.TextField" positionIncrementGap="100">
>> > >>     <analyzer
>> > type="index">
>> > >>       <tokenizer
>> > class="solr.PatternTokenizerFactory" pattern=";"/>
>> > >>        <filter
>> > class="solr.LowerCaseFilterFactory"/>
>> > >>        <filter
>> > class="solr.TrimFilterFactory" />
>> > >>     </analyzer>
>> > >>     <analyzer
>> > type="query">
>> > >>        <tokenizer
>> > class="solr.WhitespaceTokenizerFactory"/>
>> > >> <filter
>> > class="solr.LowerCaseFilterFactory"/>
>> > >>        <filter
>> > class="solr.TrimFilterFactory" />
>> > >> <filter class="solr.ShingleFilterFactory"
>> > />
>> > >>      </analyzer>
>> > >>   </fieldType>
>> > >>
>> > >> In the analysis tool this schema looks like it
>> > works correctly.  Our
>> > >> multi-word keywords are indexed as a single entry,
>> > and then when a search
>> > >> phrase contains one of these multi-word keywords
>> > it is shingled and
>> > >> matched.
>> > >>  Unfortunately, when we do the same queries
>> > on top of the actual index it
>> > >> responds with zero matches.  I can see in the
>> > index histogram that the
>> > >> terms
>> > >> are correctly indexed from our mysql datasource
>> > containing the keywords,
>> > >> but
>> > >> somehow the shingling doesn't appear to work on
>> > this live data.  Does
>> > >> anyone
>> > >> have experience with shingling that might have
>> > some tips for us, or
>> > >> otherwise advice for debugging the issue?
>> > >>
>> > >> Thanks,
>> > >> Jeff
>> > >>
>> > >>
>> > >>
>> > >
>> >
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to