[jira] [Updated] (SOLR-14434) Multiterm Analyzer Not Persisted in Managed Schema

Trey Grainger (Jira) Thu, 23 Apr 2020 21:04:43 -0700


     [ 
https://issues.apache.org/jira/browse/SOLR-14434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Trey Grainger updated SOLR-14434:
---------------------------------
    Description: 
In addition to "{{index}}" and "{{query}}" analyzers, Solr supports adding an 
explicit "{{multiterm}}" analyzer to schema {{fieldType}} definitions. This 
allows for specific control over analysis for things like wildcard terms, 
prefix queries, range queries, etc. For example, the following would cause the 
wildcard query for "{{hats*}}" to get stemmed to "{{hat*}}" instead of 
"{{hats*}}", and thus match on the indexed version of "{{hat}}".
{code:java}
  <fieldType class="solr.TextField" multiValued="true" name="multiterm_test" 
positionIncrementGap="100" termOffsets="true" termVectors="true">
    <analyzer type="index">
      <tokenizer class="solr.ClassicTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishMinimalStemFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.ClassicTokenizerFactory"/>
      <filter class="solr.SynonymGraphFilterFactory" expand="true" 
ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishMinimalStemFilterFactory"/>
    </analyzer>
    <analyzer type="multiterm">
      <tokenizer class="solr.ClassicTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishMinimalStemFilterFactory"/>
    </analyzer>
  </fieldType>{code}
This works fine if using a non-managed schema (i.e. {{schema.xml}} file) OR if 
you use managed schema (i.e. {{managed-schema}} file) and push your schema 
directly to Zookeeper. However, starting with Solr 8.0, if you use the Schema 
API to add a {{fieldType}}, the {{multiterm}} analyzers are not persisted (only 
{{index}} and {{query}} analyzers are).

This bug seems to have originated from LUCENE-8497, which refactored this code 
area substantially. The bug is caused by the managed schema being able to READ 
in the {{multiterm}} analyzers from the schema file, but then being unable to 
write them out. Since pushing the schema directly to Zookeeper only requires 
Solr reading them in, this bug would not have been obvious in initial testing. 
However, since the schema API reads in the schema file, writes an updated 
schema out to Zookeeper (where the bug occurs), and then reads the file back 
in, all of the {{multiTerm}} analyzers get stripped out.

I've identified the problematic code and am looking into an appropriate fix.

  was:
In addition to "{{index}}" and "{{query}}" analyzers, Solr supports adding an 
explicit "{{multiterm}}" analyzer to schema f\{{ieldType}} definitions. This 
allows for specific control over analysis for things like wildcard terms, 
prefix queries, range queries, etc. For example, the following would cause the 
wildcard query for "{{hats*}}" to get stemmed to "{{hat*}}" instead of 
"{{hats*}}", and thus match on the indexed version of "{{hat}}".
{code:java}
  <fieldType class="solr.TextField" multiValued="true" name="multiterm_test" 
positionIncrementGap="100" termOffsets="true" termVectors="true">
    <analyzer type="index">
      <tokenizer class="solr.ClassicTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishMinimalStemFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.ClassicTokenizerFactory"/>
      <filter class="solr.SynonymGraphFilterFactory" expand="true" 
ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishMinimalStemFilterFactory"/>
    </analyzer>
    <analyzer type="multiterm">
      <tokenizer class="solr.ClassicTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishMinimalStemFilterFactory"/>
    </analyzer>
  </fieldType>{code}
This works fine if using a non-managed schema (i.e. {{schema.xml}} file) OR if 
you use managed schema (i.e. {{managed-schema}} file) and push your schema 
directly to Zookeeper. However, starting with Solr 8.0, if you use the Schema 
API to add a {{fieldType}}, the {{multiterm}} analyzers are not persisted (only 
{{index}} and {{query}} analyzers are).

This bug seems to have originated from LUCENE-8497, which refactored this code 
area substantially. The bug is caused by the managed schema being able to READ 
in the {{multiterm}} analyzers from the schema file, but then being unable to 
write them out. Since pushing the schema directly to Zookeeper only requires 
Solr reading them in, this bug would not have been obvious in initial testing. 
However, since the schema API reads in the schema file, writes an updated 
schema out to Zookeeper (where the bug occurs), and then reads the file back 
in, all of the {{multiTerm}} analyzers get stripped out.

I've identified the problematic code and am looking into an appropriate fix.


> Multiterm Analyzer Not Persisted in Managed Schema
> --------------------------------------------------
>
>                 Key: SOLR-14434
>                 URL: https://issues.apache.org/jira/browse/SOLR-14434
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Schema and Analysis
>    Affects Versions: 8.0, 8.1, 8.2, 8.1.1, 8.3, 8.4, 8.3.1, 8.5, 8.4.1, 8.5.1
>            Reporter: Trey Grainger
>            Priority: Major
>
> In addition to "{{index}}" and "{{query}}" analyzers, Solr supports adding an 
> explicit "{{multiterm}}" analyzer to schema {{fieldType}} definitions. This 
> allows for specific control over analysis for things like wildcard terms, 
> prefix queries, range queries, etc. For example, the following would cause 
> the wildcard query for "{{hats*}}" to get stemmed to "{{hat*}}" instead of 
> "{{hats*}}", and thus match on the indexed version of "{{hat}}".
> {code:java}
>   <fieldType class="solr.TextField" multiValued="true" name="multiterm_test" 
> positionIncrementGap="100" termOffsets="true" termVectors="true">
>     <analyzer type="index">
>       <tokenizer class="solr.ClassicTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.EnglishMinimalStemFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.ClassicTokenizerFactory"/>
>       <filter class="solr.SynonymGraphFilterFactory" expand="true" 
> ignoreCase="true" synonyms="synonyms.txt"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.EnglishMinimalStemFilterFactory"/>
>     </analyzer>
>     <analyzer type="multiterm">
>       <tokenizer class="solr.ClassicTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.EnglishMinimalStemFilterFactory"/>
>     </analyzer>
>   </fieldType>{code}
> This works fine if using a non-managed schema (i.e. {{schema.xml}} file) OR 
> if you use managed schema (i.e. {{managed-schema}} file) and push your schema 
> directly to Zookeeper. However, starting with Solr 8.0, if you use the Schema 
> API to add a {{fieldType}}, the {{multiterm}} analyzers are not persisted 
> (only {{index}} and {{query}} analyzers are).
> This bug seems to have originated from LUCENE-8497, which refactored this 
> code area substantially. The bug is caused by the managed schema being able 
> to READ in the {{multiterm}} analyzers from the schema file, but then being 
> unable to write them out. Since pushing the schema directly to Zookeeper only 
> requires Solr reading them in, this bug would not have been obvious in 
> initial testing. However, since the schema API reads in the schema file, 
> writes an updated schema out to Zookeeper (where the bug occurs), and then 
> reads the file back in, all of the {{multiTerm}} analyzers get stripped out.
> I've identified the problematic code and am looking into an appropriate fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-14434) Multiterm Analyzer Not Persisted in Managed Schema

Reply via email to