Re: EdgeNGram relevancy

Andy Thu, 11 Nov 2010 13:18:54 -0800

Ah I see. Thanks for the explanation.

Could you set the defaultOperator to "AND"? That way both "Bill" and "Cl" must 
be a match and that would exclude "Clyde Phillips".



--- On Thu, 11/11/10, Robert Gründler <rob...@dubture.com> wrote:

> From: Robert Gründler <rob...@dubture.com>
> Subject: Re: EdgeNGram relevancy
> To: solr-user@lucene.apache.org
> Date: Thursday, November 11, 2010, 3:51 PM
> according to the fieldtype i posted
> previously, i think it's because of:
> 
> 1. WhiteSpaceTokenizer splits the String "Clyde Phillips"
> into 2 tokens: "Clyde" and "Phillips"
> 2. EdgeNGramFilter gets the 2 tokens, and creates an
> EdgeNGram for each token: "C" "Cl" "Cly"
> ...   AND  "P" "Ph" "Phi" ...
> 
> The Query String "Bill Cl" gets split up in 2 Tokens "Bill"
> and "Cl" by the WhitespaceTokenizer.
> 
> This creates a match for the 2nd token "Ci" of the query,
> and one of the "sub"tokens the EdgeNGramFilter created:
> "Cl".
> 
> 
> -robert
> 
> 
> 
> 
> On Nov 11, 2010, at 21:34 , Andy wrote:
> 
> > Could anyone help me understand what does "Clyde
> Phillips" appear in the results for "Bill Cl"??
> > 
> > "Clyde Phillips" doesn't produce any EdgeNGram that
> would match "Bill Cl", so why is it even in the results?
> > 
> > Thanks.
> > 
> > --- On Thu, 11/11/10, Ahmet Arslan <iori...@yahoo.com>
> wrote:
> > 
> >> You can add an additional field, with
> >> using KeywordTokenizerFactory instead of
> >> WhitespaceTokenizerFactory. And query both these
> fields with
> >> an OR operator. 
> >> 
> >> edgytext:(Bill Cl) OR edgytext2:"Bill Cl"
> >> 
> >> You can even apply boost so that begins with
> matches comes
> >> first.
> >> 
> >> --- On Thu, 11/11/10, Robert Gründler <rob...@dubture.com>
> >> wrote:
> >> 
> >>> From: Robert Gründler <rob...@dubture.com>
> >>> Subject: EdgeNGram relevancy
> >>> To: solr-user@lucene.apache.org
> >>> Date: Thursday, November 11, 2010, 5:51 PM
> >>> Hi,
> >>> 
> >>> consider the following fieldtype (used for
> >>> autocompletion):
> >>> 
> >>>   <fieldType
> name="edgytext"
> >> class="solr.TextField"
> >>> positionIncrementGap="100">
> >>>    <analyzer type="index">
> >>>      <tokenizer
> >>> class="solr.WhitespaceTokenizerFactory"/>
> >>>      <filter
> >>> class="solr.LowerCaseFilterFactory"/>
> >>>      <filter
> >>> class="solr.StopFilterFactory"
> ignoreCase="true"
> >>> words="stopwords.txt"
> enablePositionIncrements="true"
> >>> />     
> >>>          <filter
> >>> class="solr.PatternReplaceFilterFactory"
> >> pattern="([^a-z])"
> >>> replacement="" replace="all" />
> >>>      <filter
> >>> class="solr.EdgeNGramFilterFactory"
> minGramSize="1"
> >>> maxGramSize="25" />
> >>>    </analyzer>
> >>>    <analyzer type="query">
> >>>      <tokenizer
> >>> class="solr.WhitespaceTokenizerFactory"/>
> >>>      <filter
> >>> class="solr.LowerCaseFilterFactory"/>
> >>>      <filter
> >>> class="solr.StopFilterFactory"
> ignoreCase="true"
> >>> words="stopwords.txt"
> enablePositionIncrements="true"
> >> />
> >>>          <filter
> >>> class="solr.PatternReplaceFilterFactory"
> >> pattern="([^a-z])"
> >>> replacement="" replace="all" />
> >>>    </analyzer>
> >>>   </fieldType>
> >>> 
> >>> 
> >>> This works fine as long as the query string is
> a
> >> single
> >>> word. For multiple words, the ranking is
> weird
> >> though.
> >>> 
> >>> Example:
> >>> 
> >>> Query String: "Bill Cl"
> >>> 
> >>> Result (in that order):
> >>> 
> >>> - Clyde Phillips
> >>> - Clay Rogers
> >>> - Roger Cloud
> >>> - Bill Clinton
> >>> 
> >>> "Bill Clinton" should have the highest rank in
> that
> >>> case.  
> >>> 
> >>> Has anyone an idea how to to configure this
> fieldtype
> >> to
> >>> make matches in both tokens rank higher than
> those who
> >> match
> >>> in either token?
> >>> 
> >>> 
> >>> thanks!
> >>> 
> >>> 
> >>> -robert
> >>> 
> >>> 
> >>> 
> >>> 
> >> 
> >> 
> >> 
> >> 
> > 
> > 
> > 
> 
>

Re: EdgeNGram relevancy

Reply via email to