[ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014065#comment-17014065 ]
Tomoko Uchida commented on LUCENE-9123: --------------------------------------- Hi [~h.kazuaki], introducing the option {{discardCompoundToken}} looks fine to me, however, I think we shouldn't change signatures of the existing constructors for backwards compatibility (they are public interface, so we have to keep them during 8.x anyways). Instead, we can add a new constructor. Opinions? > JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter > ----------------------------------------------------------------------- > > Key: LUCENE-9123 > URL: https://issues.apache.org/jira/browse/LUCENE-9123 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 8.4 > Reporter: Kazuaki Hiraga > Priority: Major > Attachments: LUCENE-9123.patch > > > JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with > both of SynonymGraphFilter and SynonymFilter when JT generates multiple > tokens as an output. If we use `mode=normal`, it should be fine. However, we > would like to use decomposed tokens that can maximize to chance to increase > recall. > Snippet of schema: > {code:xml} > <fieldType name="text_custom_ja" class="solr.TextField" > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > <analyzer> > <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/> > <filter class="solr.SynonymGraphFilterFactory" > synonyms="lang/synonyms_ja.txt" > tokenizerFactory="solr.JapaneseTokenizerFactory"/> > <filter class="solr.JapaneseBaseFormFilterFactory"/> > <!-- Removes tokens with certain part-of-speech tags --> > <filter class="solr.JapanesePartOfSpeechStopFilterFactory" > tags="lang/stoptags_ja.txt" /> > <!-- Normalizes full-width romaji to half-width and half-width kana > to full-width (Unicode NFKC subset) --> > <filter class="solr.CJKWidthFilterFactory"/> > <!-- Removes common tokens typically not useful for search, but have > a negative effect on ranking --> > <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" > words="lang/stopwords_ja.txt" /> --> > <!-- Normalizes common katakana spelling variations by removing any > last long sound character (U+30FC) --> > <filter class="solr.JapaneseKatakanaStemFilterFactory" > minimumLength="4"/> > <!-- Lower-cases romaji characters --> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > {code} > An synonym entry that generates error: > {noformat} > 株式会社,コーポレーション > {noformat} > The following is an output on console: > {noformat} > $ ./bin/solr create_core -c jp_test -d ../config/solrconfs > ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] > Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 > (got: 0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org