I wonder if this might be similar/related to the underlying problem that is intended to be addressed by https://issues.apache.org/jira/browse/LUCENE-8985?
btw, I think you only want to use FlattenGraphFilter *once* in the indexing analysis chain, towards the end (after all components that emit graphs). ...though that's probably *not* what's causing the problem (based on the fact that the extra FGF doesn't seem to modify any attributes). On Mon, Nov 25, 2019 at 2:19 PM Eric Buss <ericb...@abebooks.com> wrote: > > Hi all, > > I have been trying to solve an issue where FlattenGraphFilter (FGF) removes > tokens produced by WordDelimiterGraphFilter (WDGF) - consequently searches > that > contain the contraction "can't" do not match. > > This is on Solr version 7.7.1. > > The field in question is defined as follows: > > <field name="myField" type="text_general" indexed="true" stored="true"/> > > And the relevant fieldType "text_general": > > <fieldType name="text_general" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt"/> > <filter class="solr.WordDelimiterGraphFilterFactory" > stemEnglishPossessive="0" preserveOriginal="1" catenateAll="1" > splitOnCaseChange="0"/> > <filter class="solr.FlattenGraphFilterFactory"/> > <filter class="solr.SynonymGraphFilterFactory" > synonyms="synonyms.txt" ignoreCase="true" expand="true"/> > <filter class="solr.FlattenGraphFilterFactory"/> > <filter > class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt"/> > <filter class="solr.WordDelimiterGraphFilterFactory" > stemEnglishPossessive="0" preserveOriginal="0" catenateAll="0" > splitOnCaseChange="0"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt"/> > <filter > class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/> > </analyzer> > </fieldType> > > Finally, the relevant entries in synonyms.txt are: > > can,cans > cants,cant > > Using the Solr console Analysis and "can't" as the Field Value, the following > tokens are produced (find the verbose output at the bottom of this email): > > Index > ST | can't > SF | can't > WDGF | cant | can't | can | t > FGF | cant | can't | can | t > SGF | cants | cant | can't | | cans | can | t > ICUFF | cants | cant | can't | | cans | can | t > FGF | cants | cant | can't | | t > > Query > ST | can't > SF | can't > WDGF | can | t > SF | can | t > ICUFF | can | t > > As you can see after the FGF the tokens "can" and "cans" are pruned so the > query > does not match. Is there a reasonable way to preserve these tokens? > > My key concern is that I want the "fix" for this to have as little impact on > other queries as possible. > > Some things I have checked/tried: > > Searching for similar problems I found this thread: > https://lucene.472066.n3.nabble.com/Questions-for-SynonymGraphFilter-and-WordDelimiterGraphFilter-td4420154.html > Here it is suggested that FGF is not necessary (without any supporting > evidence). This goes directly against the documentation that states "If you > use > [the SynonymGraphFilter] during indexing, you must follow it with a Flatten > Graph Filter": > https://lucene.apache.org/solr/guide/7_0/filter-descriptions.html > Despite this warning I tried out removing the FGF on a local > cluster and indeed it still runs and this search now works, however I am > paranoid that this will break far more things than it fixes. > > I have tried adding the FGF as a filter to the query. This does not eliminate > the "can" term in the query analysis. > > I have tested other contracted words. Some have this issue as well - others do > not. "haven't", "shouldn't", "couldn't", "I'll", "weren't", "ain't" all > preserve their tokens "won't" does not. I believe the pattern here is that > whenever part of the contraction has synonyms this problem manifests. > > Eliminating WDGF is not viable as we rely on this functionality for other uses > of delimiters (such as wi-fi -> wi fi). > > Performing WDGF after synonyms is also not viable as in the case that we have > the data "historical-text" we want this to match the search "history text". > > The hacky solution I have found is to use the PatternReplaceFilterFactory to > replace "can't" with "cant". Though this technically solves the issue, I hope > it > is obvious why this does not feel like an ideal solution. > > Has anyone encountered this type of issue before? Any advice on how the filter > use here could be improved to handle this case? > > Thanks, > Eric Buss > > > PS. The verbose output from Analysis of "can't" > > Index > > ST | text | can't | > | raw_bytes | [63 61 6e 27 74] | > | start | 0 | > | end | 5 | > | positionLength| 1 | > | type | <ALPHANUM> | > | termFrequency | 1 | > | position | 1 | > SF | text | can't | > | raw_bytes | [63 61 6e 27 74] | > | start | 0 | > | end | 5 | > | positionLength| 1 | > | type | <ALPHANUM> | > | termFrequency | 1 | > | position | 1 | > WDGF | text | cant | can't | can | t > | > | raw_bytes | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e] | [74] > | > | start | 0 | 0 | 0 | 4 > | > | end | 5 | 5 | 3 | 5 > | > | positionLength| 2 | 2 | 1 | 1 > | > | type | <ALPHANUM> | <ALPHANUM> | <ALPHANUM> | > <ALPHANUM> | > | termFrequency | 1 | 1 | 1 | 1 > | > | position | 1 | 1 | 1 | 2 > | > | keyword | false | false | false | false > | > FGF | text | cant | can't | can | t > | > | raw_bytes | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e] | [74] > | > | start | 0 | 0 | 0 | 4 > | > | end | 5 | 5 | 3 | 5 > | > | positionLength| 2 | 2 | 1 | 1 > | > | type | <ALPHANUM> | <ALPHANUM> | <ALPHANUM> | > <ALPHANUM> | > | termFrequency | 1 | 1 | 1 | 1 > | > | position | 1 | 1 | 1 | 2 > | > | keyword | false | false | false | false > | > SGF | text | cants | cant | can't | > cans | can | t | > | raw_bytes | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | > [63 61 6e 73] | [63 61 6e] | [74] | > | start | 0 | 0 | 0 | > 0 | 0 | 4 | > | end | 5 | 5 | 5 | > 3 | 3 | 5 | > | positionLength| 1 | 1 | 2 | > 1 | 1 | 1 | > | type | SYNONYM | <ALPHANUM> | <ALPHANUM> | > SYNONYM | <ALPHANUM> | <ALPHANUM> | > | termFrequency | 1 | 1 | 1 | > 1 | 1 | 1 | > | position | 1 | 1 | 1 | > 3 | 3 | 4 | > | keyword | false | false | false | > false | false | false | > FGF | text | cants | cant | can't | > t | > | raw_bytes | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | > [74] | > | start | 0 | 0 | 0 | > 4 | > | end | 5 | 5 | 5 | > 5 | > | positionLength| 1 | 1 | 1 | > 1 | > | type | SYNONYM | <ALPHANUM> | <ALPHANUM> | > <ALPHANUM> | > | termFrequency | 1 | 1 | 1 | > 1 | > | position | 1 | 1 | 1 | > 3 | > | keyword | false | false | false | > false | > ICUFF | text | cants | cant | can't | > t | > | raw_bytes | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | > [74] | > | start | 0 | 0 | 0 | > 4 | > | end | 5 | 5 | 5 | > 5 | > | positionLength| 1 | 1 | 1 | > 1 | > | type | SYNONYM | <ALPHANUM> | <ALPHANUM> | > <ALPHANUM> | > | termFrequency | 1 | 1 | 1 | > 1 | > | position | 1 | 1 | 1 | > 3 | > | keyword | false | false | false | > false | > > Query > > ST | text | can't | > | raw_bytes | [63 61 6e 27 74] | > | start | 0 | > | end | 5 | > | positionLength| 1 | > | type | <ALPHANUM> | > | termFrequency | 1 | > | position | 1 | > SF | text | can't | > | raw_bytes | [63 61 6e 27 74] | > | start | 0 | > | end | 5 | > | positionLength| 1 | > | type | <ALPHANUM> | > | termFrequency | 1 | > | position | 1 | > WDGF | text | can | t | > | raw_bytes | [63 61 6e] | [74] | > | start | 0 | 4 | > | end | 3 | 5 | > | positionLength| 1 | 1 | > | type | <ALPHANUM> | <ALPHANUM> | > | termFrequency | 1 | 1 | > | position | 1 | 2 | > | keyword | false | false | > SF | text | can | t | > | raw_bytes | [63 61 6e] | [74] | > | start | 0 | 4 | > | end | 3 | 5 | > | positionLength| 1 | 1 | > | type | <ALPHANUM> | <ALPHANUM> | > | termFrequency | 1 | 1 | > | position | 1 | 2 | > | keyword | false | false | > ICUFF | text | can | t | > | raw_bytes | [63 61 6e] | [74] | > | start | 0 | 4 | > | end | 3 | 5 | > | positionLength| 1 | 1 | > | type | <ALPHANUM> | <ALPHANUM> | > | termFrequency | 1 | 1 | > | position | 1 | 2 | > | keyword | false | false | >