Kazuaki Hiraga created LUCENE-9123: -------------------------------------- Summary: JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter Key: LUCENE-9123 URL: https://issues.apache.org/jira/browse/LUCENE-9123 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 8.4 Reporter: Kazuaki Hiraga
JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with both of SynonymGraphFilter and SynonymFilter when JT generates multiple tokens as an output. If we use `mode=normal`, it should be fine. However, we would like to use decomposed tokens that can maximize to chance to increase recall. Snippet of schema: {code:xml} <fieldType name="text_custom_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false"> <analyzer> <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/> <filter class="solr.SynonymGraphFilterFactory" synonyms="lang/synonyms_ja.txt" tokenizerFactory="solr.JapaneseTokenizerFactory"/> <filter class="solr.JapaneseBaseFormFilterFactory"/> <!-- Removes tokens with certain part-of-speech tags --> <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt" /> <!-- Normalizes full-width romaji to half-width and half-width kana to full-width (Unicode NFKC subset) --> <filter class="solr.CJKWidthFilterFactory"/> <!-- Removes common tokens typically not useful for search, but have a negative effect on ranking --> <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ja.txt" /> --> <!-- Normalizes common katakana spelling variations by removing any last long sound character (U+30FC) --> <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/> <!-- Lower-cases romaji characters --> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> {code} An synonym entry that generates error: {noformat} 株式会社,コーポレーション {noformat} The following is an output on console: {noformat} $ ./bin/solr create_core -c jp_test -d ../config/solrconfs ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 (got: 0) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org