[ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018748#comment-17018748 ]
Kazuaki Hiraga commented on LUCENE-9123: ---------------------------------------- {quote} Also we need to add some tests to TestJapaneseTokenizer and TestJapaneseTokenizerFactory. And according to the custom, the final patch to the master branch should be named "LUCENE-9123.patch" so can you please overwrite the obsolete patch instead of upload new ones? {quote} [~tomoko], Thank you very much! I will prepare updated patch that will include some unit tests for both of 8x and the maser in a few days. {quote} I'm not sure if there is explicit maintainer on each Lucene module, theoretically every person who has write access to the ASF repo can commit any patches on his own responsibility. Let us wait for a few days and I will commit the patch if there are no other comments or objections. {quote} OK. I understood! Again, Thank you for your warm feedback! > JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter > ----------------------------------------------------------------------- > > Key: LUCENE-9123 > URL: https://issues.apache.org/jira/browse/LUCENE-9123 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 8.4 > Reporter: Kazuaki Hiraga > Assignee: Tomoko Uchida > Priority: Major > Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch > > > JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with > both of SynonymGraphFilter and SynonymFilter when JT generates multiple > tokens as an output. If we use `mode=normal`, it should be fine. However, we > would like to use decomposed tokens that can maximize to chance to increase > recall. > Snippet of schema: > {code:xml} > <fieldType name="text_custom_ja" class="solr.TextField" > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > <analyzer> > <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/> > <filter class="solr.SynonymGraphFilterFactory" > synonyms="lang/synonyms_ja.txt" > tokenizerFactory="solr.JapaneseTokenizerFactory"/> > <filter class="solr.JapaneseBaseFormFilterFactory"/> > <!-- Removes tokens with certain part-of-speech tags --> > <filter class="solr.JapanesePartOfSpeechStopFilterFactory" > tags="lang/stoptags_ja.txt" /> > <!-- Normalizes full-width romaji to half-width and half-width kana > to full-width (Unicode NFKC subset) --> > <filter class="solr.CJKWidthFilterFactory"/> > <!-- Removes common tokens typically not useful for search, but have > a negative effect on ranking --> > <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" > words="lang/stopwords_ja.txt" /> --> > <!-- Normalizes common katakana spelling variations by removing any > last long sound character (U+30FC) --> > <filter class="solr.JapaneseKatakanaStemFilterFactory" > minimumLength="4"/> > <!-- Lower-cases romaji characters --> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > {code} > An synonym entry that generates error: > {noformat} > 株式会社,コーポレーション > {noformat} > The following is an output on console: > {noformat} > $ ./bin/solr create_core -c jp_test -d ../config/solrconfs > ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] > Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 > (got: 0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org