Kazuaki Hiraga created LUCENE-9123:
--------------------------------------

             Summary: JapaneseTokenizer with search mode doesn't work with 
SynonymGraphFilter
                 Key: LUCENE-9123
                 URL: https://issues.apache.org/jira/browse/LUCENE-9123
             Project: Lucene - Core
          Issue Type: Bug
          Components: modules/analysis
    Affects Versions: 8.4
            Reporter: Kazuaki Hiraga


JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with both 
of SynonymGraphFilter and SynonymFilter when JT generates multiple tokens as an 
output. If we use `mode=normal`, it should be fine. However, we would like to 
use decomposed tokens that can maximize to chance to increase recall.

Snippet of schema:
{code:xml}
    <fieldType name="text_custom_ja" class="solr.TextField" 
positionIncrementGap="100" autoGeneratePhraseQueries="false">
      <analyzer>
        <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
        <filter class="solr.SynonymGraphFilterFactory"
                    synonyms="lang/synonyms_ja.txt"
                    tokenizerFactory="solr.JapaneseTokenizerFactory"/>

        <filter class="solr.JapaneseBaseFormFilterFactory"/>
        <!-- Removes tokens with certain part-of-speech tags -->
        <filter class="solr.JapanesePartOfSpeechStopFilterFactory" 
tags="lang/stoptags_ja.txt" />
        <!-- Normalizes full-width romaji to half-width and half-width kana to 
full-width (Unicode NFKC subset) -->
        <filter class="solr.CJKWidthFilterFactory"/>
        <!-- Removes common tokens typically not useful for search, but have a 
negative effect on ranking -->
        <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="lang/stopwords_ja.txt" /> -->
        <!-- Normalizes common katakana spelling variations by removing any 
last long sound character (U+30FC) -->
        <filter class="solr.JapaneseKatakanaStemFilterFactory" 
minimumLength="4"/>
        <!-- Lower-cases romaji characters -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
{code}

An synonym entry that generates error:
{noformat}
株式会社,コーポレーション
{noformat}

The following is an output on console:
{noformat}
$ ./bin/solr create_core -c jp_test -d ../config/solrconfs

ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
(got: 0)
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to