Re: EdgeNGram relevancy

Robert Gründler Thu, 11 Nov 2010 10:08:30 -0800

thanks a lot, that setup works pretty well now.

the only problem now is that the StopWords do not work that good anymore. I'll 
provide an example, but first the 2 fieldtypes:


  <!-- autocomplete field which finds matches inside strings ("scor" matches 
"Martin Scorsese") -->
  
  <fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />     
                 <filter class="solr.PatternReplaceFilterFactory" 
pattern="([^a-z])" replacement="" replace="all" />
     <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
maxGramSize="25" />
   </analyzer>
   <analyzer type="query">
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />
                 <filter class="solr.PatternReplaceFilterFactory" 
pattern="([^a-z])" replacement="" replace="all" />
   </analyzer>
  </fieldType>
  
  <!-- autocomplete field which finds "startsWith" matches only ("scor" matches 
only "Scorpio", but not "Martin Scorsese") -->  

  <fieldType name="edgytext2" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
     <tokenizer class="solr.KeywordTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
                 <filter class="solr.PatternReplaceFilterFactory" 
pattern="([^a-z])" replacement="" replace="all" />
     <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
maxGramSize="25" />
   </analyzer>
   <analyzer type="query">
     <tokenizer class="solr.KeywordTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
                 <filter class="solr.PatternReplaceFilterFactory" 
pattern="([^a-z])" replacement="" replace="all" />
   </analyzer>
  </fieldType>


This setup now makes troubles regarding StopWords, here's an example:

Let's say the index contains 2 Strings: "Mr Martin Scorsese" and "Martin 
Scorsese". "Mr" is in the stopword list.

Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0

This way, the only result i get is "Mr Martin Scorsese", because the strict 
field edgytext2 is boosted by 2.0. 

Any idea why in this case "Martin Scorsese" is not in the result at all?


thanks again!


-robert






On Nov 11, 2010, at 5:57 PM, Ahmet Arslan wrote:

> You can add an additional field, with using KeywordTokenizerFactory instead 
> of WhitespaceTokenizerFactory. And query both these fields with an OR 
> operator. 
> 
> edgytext:(Bill Cl) OR edgytext2:"Bill Cl"
> 
> You can even apply boost so that begins with matches comes first.
> 
> --- On Thu, 11/11/10, Robert Gründler <rob...@dubture.com> wrote:
> 
>> From: Robert Gründler <rob...@dubture.com>
>> Subject: EdgeNGram relevancy
>> To: solr-user@lucene.apache.org
>> Date: Thursday, November 11, 2010, 5:51 PM
>> Hi,
>> 
>> consider the following fieldtype (used for
>> autocompletion):
>> 
>>   <fieldType name="edgytext" class="solr.TextField"
>> positionIncrementGap="100">
>>    <analyzer type="index">
>>      <tokenizer
>> class="solr.WhitespaceTokenizerFactory"/>
>>      <filter
>> class="solr.LowerCaseFilterFactory"/>
>>      <filter
>> class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" enablePositionIncrements="true"
>> />     
>>          <filter
>> class="solr.PatternReplaceFilterFactory" pattern="([^a-z])"
>> replacement="" replace="all" />
>>      <filter
>> class="solr.EdgeNGramFilterFactory" minGramSize="1"
>> maxGramSize="25" />
>>    </analyzer>
>>    <analyzer type="query">
>>      <tokenizer
>> class="solr.WhitespaceTokenizerFactory"/>
>>      <filter
>> class="solr.LowerCaseFilterFactory"/>
>>      <filter
>> class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" enablePositionIncrements="true" />
>>          <filter
>> class="solr.PatternReplaceFilterFactory" pattern="([^a-z])"
>> replacement="" replace="all" />
>>    </analyzer>
>>   </fieldType>
>> 
>> 
>> This works fine as long as the query string is a single
>> word. For multiple words, the ranking is weird though.
>> 
>> Example:
>> 
>> Query String: "Bill Cl"
>> 
>> Result (in that order):
>> 
>> - Clyde Phillips
>> - Clay Rogers
>> - Roger Cloud
>> - Bill Clinton
>> 
>> "Bill Clinton" should have the highest rank in that
>> case.  
>> 
>> Has anyone an idea how to to configure this fieldtype to
>> make matches in both tokens rank higher than those who match
>> in either token?
>> 
>> 
>> thanks!
>> 
>> 
>> -robert
>> 
>> 
>> 
>> 
> 
> 
>

Re: EdgeNGram relevancy

Reply via email to