Re: FlattenGraphFilter Eliminates Tokens - Can't match "Can't"

Michael Gibney Thu, 05 Dec 2019 10:24:05 -0800

I wonder if this might be similar/related to the underlying problem
that is intended to be addressed by
https://issues.apache.org/jira/browse/LUCENE-8985?


btw, I think you only want to use FlattenGraphFilter *once* in the
indexing analysis chain, towards the end (after all components that
emit graphs). ...though that's probably *not* what's causing the
problem (based on the fact that the extra FGF doesn't seem to modify
any attributes).



On Mon, Nov 25, 2019 at 2:19 PM Eric Buss <ericb...@abebooks.com> wrote:
>
> Hi all,
>
> I have been trying to solve an issue where FlattenGraphFilter (FGF) removes
> tokens produced by WordDelimiterGraphFilter (WDGF) - consequently searches 
> that
> contain the contraction "can't" do not match.
>
> This is on Solr version 7.7.1.
>
> The field in question is defined as follows:
>
> <field name="myField" type="text_general" indexed="true" stored="true"/>
>
> And the relevant fieldType "text_general":
>
> <fieldType name="text_general" class="solr.TextField" 
> positionIncrementGap="100">
>     <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterGraphFilterFactory" 
> stemEnglishPossessive="0" preserveOriginal="1" catenateAll="1" 
> splitOnCaseChange="0"/>
>         <filter class="solr.FlattenGraphFilterFactory"/>
>         <filter class="solr.SynonymGraphFilterFactory" 
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.FlattenGraphFilterFactory"/>
>         <filter 
> class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterGraphFilterFactory" 
> stemEnglishPossessive="0" preserveOriginal="0" catenateAll="0" 
> splitOnCaseChange="0"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt"/>
>         <filter 
> class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
>     </analyzer>
> </fieldType>
>
> Finally, the relevant entries in synonyms.txt are:
>
> can,cans
> cants,cant
>
> Using the Solr console Analysis and "can't" as the Field Value, the following
> tokens are produced (find the verbose output at the bottom of this email):
>
> Index
> ST    | can't
> SF    | can't
> WDGF  | cant | can't | can | t
> FGF   | cant | can't | can | t
> SGF   | cants | cant | can't | | cans | can | t
> ICUFF | cants | cant | can't | | cans | can | t
> FGF   | cants | cant | can't | | t
>
> Query
> ST    | can't
> SF    | can't
> WDGF  | can | t
> SF    | can | t
> ICUFF | can | t
>
> As you can see after the FGF the tokens "can" and "cans" are pruned so the 
> query
> does not match. Is there a reasonable way to preserve these tokens?
>
> My key concern is that I want the "fix" for this to have as little impact on
> other queries as possible.
>
> Some things I have checked/tried:
>
> Searching for similar problems I found this thread:
> https://lucene.472066.n3.nabble.com/Questions-for-SynonymGraphFilter-and-WordDelimiterGraphFilter-td4420154.html
> Here it is suggested that FGF is not necessary (without any supporting
> evidence). This goes directly against the documentation that states "If you 
> use
> [the SynonymGraphFilter] during indexing, you must follow it with a Flatten
> Graph Filter":
> https://lucene.apache.org/solr/guide/7_0/filter-descriptions.html
> Despite this warning I tried out removing the FGF on a local
> cluster and indeed it still runs and this search now works, however I am
> paranoid that this will break far more things than it fixes.
>
> I have tried adding the FGF as a filter to the query. This does not eliminate
> the "can" term in the query analysis.
>
> I have tested other contracted words. Some have this issue as well - others do
> not. "haven't", "shouldn't", "couldn't", "I'll", "weren't", "ain't" all
> preserve their tokens "won't" does not. I believe the pattern here is that
> whenever part of the contraction has synonyms this problem manifests.
>
> Eliminating WDGF is not viable as we rely on this functionality for other uses
> of delimiters (such as wi-fi -> wi fi).
>
> Performing WDGF after synonyms is also not viable as in the case that we have
> the data "historical-text" we want this to match the search "history text".
>
> The hacky solution I have found is to use the PatternReplaceFilterFactory to
> replace "can't" with "cant". Though this technically solves the issue, I hope 
> it
> is obvious why this does not feel like an ideal solution.
>
> Has anyone encountered this type of issue before? Any advice on how the filter
> use here could be improved to handle this case?
>
> Thanks,
> Eric Buss
>
>
> PS. The verbose output from Analysis of "can't"
>
> Index
>
> ST    | text          | can't            |
>       | raw_bytes     | [63 61 6e 27 74] |
>       | start         | 0                |
>       | end           | 5                |
>       | positionLength| 1                |
>       | type          | <ALPHANUM>       |
>       | termFrequency | 1                |
>       | position      | 1                |
> SF    | text          | can't            |
>       | raw_bytes     | [63 61 6e 27 74] |
>       | start         | 0                |
>       | end           | 5                |
>       | positionLength| 1                |
>       | type          | <ALPHANUM>       |
>       | termFrequency | 1                |
>       | position      | 1                |
> WDGF  | text          | cant          | can't            | can        | t     
>      |
>       | raw_bytes     | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e] | [74]  
>      |
>       | start         | 0             | 0                | 0          | 4     
>      |
>       | end           | 5             | 5                | 3          | 5     
>      |
>       | positionLength| 2             | 2                | 1          | 1     
>      |
>       | type          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM> | 
> <ALPHANUM> |
>       | termFrequency | 1             | 1                | 1          | 1     
>      |
>       | position      | 1             | 1                | 1          | 2     
>      |
>       | keyword       | false         | false            | false      | false 
>      |
> FGF   | text          | cant          | can't            | can        | t     
>      |
>       | raw_bytes     | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e] | [74]  
>      |
>       | start         | 0             | 0                | 0          | 4     
>      |
>       | end           | 5             | 5                | 3          | 5     
>      |
>       | positionLength| 2             | 2                | 1          | 1     
>      |
>       | type          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM> | 
> <ALPHANUM> |
>       | termFrequency | 1             | 1                | 1          | 1     
>      |
>       | position      | 1             | 1                | 1          | 2     
>      |
>       | keyword       | false         | false            | false      | false 
>      |
> SGF   | text          | cants            | cant          | can't            | 
> cans          | can        | t          |
>       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | 
> [63 61 6e 73] | [63 61 6e] | [74]       |
>       | start         | 0                | 0             | 0                | 
> 0             | 0          | 4          |
>       | end           | 5                | 5             | 5                | 
> 3             | 3          | 5          |
>       | positionLength| 1                | 1             | 2                | 
> 1             | 1          | 1          |
>       | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>       | 
> SYNONYM       | <ALPHANUM> | <ALPHANUM> |
>       | termFrequency | 1                | 1             | 1                | 
> 1             | 1          | 1          |
>       | position      | 1                | 1             | 1                | 
> 3             | 3          | 4          |
>       | keyword       | false            | false         | false            | 
> false         | false      | false      |
> FGF   | text          | cants            | cant          | can't            | 
> t          |
>       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | 
> [74]       |
>       | start         | 0                | 0             | 0                | 
> 4          |
>       | end           | 5                | 5             | 5                | 
> 5          |
>       | positionLength| 1                | 1             | 1                | 
> 1          |
>       | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>       | 
> <ALPHANUM> |
>       | termFrequency | 1                | 1             | 1                | 
> 1          |
>       | position      | 1                | 1             | 1                | 
> 3          |
>       | keyword       | false            | false         | false            | 
> false      |
> ICUFF | text          | cants            | cant          | can't            | 
> t          |
>       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | 
> [74]       |
>       | start         | 0                | 0             | 0                | 
> 4          |
>       | end           | 5                | 5             | 5                | 
> 5          |
>       | positionLength| 1                | 1             | 1                | 
> 1          |
>       | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>       | 
> <ALPHANUM> |
>       | termFrequency | 1                | 1             | 1                | 
> 1          |
>       | position      | 1                | 1             | 1                | 
> 3          |
>       | keyword       | false            | false         | false            | 
> false      |
>
> Query
>
> ST    | text          | can't            |
>       | raw_bytes     | [63 61 6e 27 74] |
>       | start         | 0                |
>       | end           | 5                |
>       | positionLength| 1                |
>       | type          | <ALPHANUM>       |
>       | termFrequency | 1                |
>       | position      | 1                |
> SF    | text          | can't            |
>       | raw_bytes     | [63 61 6e 27 74] |
>       | start         | 0                |
>       | end           | 5                |
>       | positionLength| 1                |
>       | type          | <ALPHANUM>       |
>       | termFrequency | 1                |
>       | position      | 1                |
> WDGF  | text          | can        | t          |
>       | raw_bytes     | [63 61 6e] | [74]       |
>       | start         | 0          | 4          |
>       | end           | 3          | 5          |
>       | positionLength| 1          | 1          |
>       | type          | <ALPHANUM> | <ALPHANUM> |
>       | termFrequency | 1          | 1          |
>       | position      | 1          | 2          |
>       | keyword       | false      | false      |
> SF    | text          | can        | t          |
>       | raw_bytes     | [63 61 6e] | [74]       |
>       | start         | 0          | 4          |
>       | end           | 3          | 5          |
>       | positionLength| 1          | 1          |
>       | type          | <ALPHANUM> | <ALPHANUM> |
>       | termFrequency | 1          | 1          |
>       | position      | 1          | 2          |
>       | keyword       | false      | false      |
> ICUFF | text          | can        | t          |
>       | raw_bytes     | [63 61 6e] | [74]       |
>       | start         | 0          | 4          |
>       | end           | 3          | 5          |
>       | positionLength| 1          | 1          |
>       | type          | <ALPHANUM> | <ALPHANUM> |
>       | termFrequency | 1          | 1          |
>       | position      | 1          | 2          |
>       | keyword       | false      | false      |
>

Re: FlattenGraphFilter Eliminates Tokens - Can't match "Can't"

Reply via email to