Hello,

We are upgrading to Solr 7.6.0 and noticed that SynonymFilter and
WordDelimiterFilter have been deprecated. Solr doc recommends to use
SynonymGraphFilter and WordDelimiterGraphFilter instead.  In current
schema, we have text field type defined as

<fieldType name="text_syn" class="solr.TextField" omitPositions="true"
positionIncrementGap="100" autoGeneratePhraseQueries="true">

      <analyzer type="index">

        <tokenizer class="solr.WhitespaceTokenizerFactory"/>

        <filter class="solr.LowerCaseFilterFactory"/>


        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>

        <filter class="solr.WordDelimiterFilterFactory"
splitOnNumerics="0" generateWordParts="1" generateNumberParts="0"
catenateWords="1" catenateNumbers="1" catenateAll="1"
splitOnCaseChange="0" preserveOriginal="1"/>

        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

      </analyzer>

      <analyzer type="query">

        <tokenizer class="solr.WhitespaceTokenizerFactory"/>

        <filter class="solr.LowerCaseFilterFactory"/>


        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>

        <filter class="solr.WordDelimiterFilterFactory"
splitOnNumerics="0" generateWordParts="0" generateNumberParts="0"
catenateWords="0" catenateNumbers="0" catenateAll="1"
splitOnCaseChange="0" preserveOriginal="1"/>

        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

      </analyzer>

</fieldType>

In the index phase we have both SynonymFilter and WordDelimiterFilter
configured:

        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>

        <filter class="solr.WordDelimiterFilterFactory"
splitOnNumerics="0" generateWordParts="1" generateNumberParts="0"
catenateWords="1" catenateNumbers="1" catenateAll="1"
splitOnCaseChange="0" preserveOriginal="1"/>

Solr documentation states that "graph filters produces correct token
graphs, but cannot consume an input token graph correctly. When use
these two graph filter during indexing, you must follow it with a
FlattenGraphFilter". I am confused as how to replace them with the new
SynonymGraphFilter and WordDelimiterGraphFilter. A few questions:

1. Regarding the FlattenGraphFilter, is it to be used only once or
multiple times after each graph filter? Can we have the configure like
this?

       <filter class="solr.SynonymGraphFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

       <filter class="solr.FlattenGraphFilterFactory"/>

       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>

        <filter class="solr.WordDelimiterGraphFilterFactory"
splitOnNumerics="0" generateWordParts="1" generateNumberParts="0"
catenateWords="1" catenateNumbers="1" catenateAll="1"
splitOnCaseChange="0" preserveOriginal="1"/>

       <filter class="solr.FlattenGraphFilterFactory"/>

2. Is it possible to we have two graph filters, i.e. both
SynonymGraphFilter and WordDelimiterGraphFilter in the same analysis
chain? If not what's the best option to replace our current config?

3. With the StopFilterFactory in between SynonymGraphFilter and
WordDelimiterGraphFilter, I get a few index errors:

Exception writing document id XXXXXX to the index; possible analysis error

Caused by: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1

But if I move StopFilter before the SynonymGraphFilter the errors are gone.

I guess the StopFilter mess up the SynonymGraphFilter output? Not sure
if  it's a solr defect or there is a guideline that StopFilter should
not be put after graph filters.

Thanks in advance for you input.


Thanks,

Wei

Reply via email to