Re: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

Steve Rowe Mon, 05 Feb 2018 08:27:34 -0800

Hi Александр,

> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <[email protected]> wrote:
> 
> There should be no problem with using them together.


I believe Shawn is wrong.

From 
<http://lucene.apache.org/core/7_2_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymGraphFilter.html>:

> NOTE: this cannot consume an incoming graph; results will be undefined.

Unfortunately, the ref guide entry for Synonym Graph Filter 
<https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#synonym-graph-filter>
 doesn’t include a warning about this, but it should, like the warning on Word 
Delimiter Graph Filter 
<https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#word-delimiter-graph-filter>:

> Note: although this filter produces correct token graphs, it cannot consume 
> an input token graph correctly.

(I’ve just committed a change to the ref guide source to add this also on the 
Synonym Graph Filter and Managed Synonym Graph Filter entries, to be included 
in the ref guide for Solr 7.3.)

In short, the combination of the two filters is not supported, because WDGF 
produces a token graph, which SGF cannot correctly interpret.

Other filters also have this issue, see e.g. 
<https://issues.apache.org/jira/browse/LUCENE-3475> for ShingleFilter; this 
issue has gotten some attention recently, and hopefully it will inspire fixes 
elsewhere.

Patches welcome!

--
Steve
www.lucidworks.com


> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <[email protected]> wrote:
> 
> On 2/5/2018 3:55 AM, Александр Шестак wrote:
>> 
>> Hi, I have misunderstanding about usage of SynonymGraphFilterFactory
>> and  WordDelimiterGraphFilterFactory. Can they be used together?
>> 
> 
> There should be no problem with using them together.  But it is always
> possible that the behavior will surprise you, while working 100% as
> designed.
> 
>> I have solr type configured in next way
>> 
>> <fieldtype name="fulltext_en" class="solr.TextField"
>> autoGeneratePhraseQueries="true">
>>   <analyzer type="index">
>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>     <filter class="solr.WordDelimiterGraphFilterFactory"
>>             generateWordParts="1" generateNumberParts="1"
>> splitOnNumerics="1"
>>             catenateWords="1" catenateNumbers="1" catenateAll="0"
>> preserveOriginal="1" protected="protwords_en.txt"/>
>>     <filter class="solr.FlattenGraphFilterFactory"/>
>>   </analyzer>
>>   <analyzer type="query">
>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>     <filter class="solr.WordDelimiterGraphFilterFactory"
>>             generateWordParts="1" generateNumberParts="1"
>> splitOnNumerics="1"
>>             catenateWords="0" catenateNumbers="0" catenateAll="0"
>> preserveOriginal="1" protected="protwords_en.txt"/>
>>     <filter class="solr.LowerCaseFilterFactory"/>
>>     <filter class="solr.SynonymGraphFilterFactory"
>>             synonyms="synonyms_en.txt" ignoreCase="true" expand="true"/>
>>   </analyzer>
>> </fieldtype>
>> 
>> So on query time it uses SynonymGraphFilterFactory after
>> WordDelimiterGraphFilterFactory.
>> Synonyms are configured in next way:
>> b=>b,boron
>> 2=>ii,2
>> 
>> Query in solr analysis tool looks so. It is shown that terms after SGF
>> have positions 3 and 4. Is it correct? I thought that they should had
>> 1 and 2 positions.
>> 
> 
> What matters is the *relative* positions.  The exact position number
> doesn't matter much.  Something new that the Graph implementations use
> is the position length.  That feature is necessary for multi-term
> synonyms to function correctly in phrase queries.
> 
> In your analysis screenshot, WDGF creates three tokens.  The two tokens
> created by splitting the input are at positions 1 and 2, which I think
> is 100% as expected.  It also sets the positionLength of the first term
> to 2, probably because it has split that term into 2 additional terms.
> 
> Then the SGF takes those last two terms and expands them.  Each of the
> synonyms is at the same position as the original term, and the relative
> positions of the two synonym pairs have not changed -- the second one is
> still one higher than the first.  I think the reason that SGF moves the
> positions two higher is because the positionLength on the "b2" term is
> 2, previously set by WDGF.  Someone with more knowledge about the Graph
> implementations may have to speak up as to whether this behavior is correct.
> 
> Because the relative positions of the split terms don't change when SGF
> runs, I think this is probably working as designed.
> 
> Thanks,
> Shawn

Re: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

Reply via email to