[ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018159#comment-17018159 ]
Kazuaki Hiraga commented on LUCENE-9123: ---------------------------------------- {quote} This solution would fix Kuromoji to create a simple chain of tokens, all with position increment 1 (no overlapping compound tokens)? {quote} Yes. Although I may need to test more documents to ensure that the fix will produce a simple chain of tokens, it seems working fine so far. {quote} Would you only use that mode when parsing the synonyms to build the synonym filter (or synonym graph filter)? (Since that seems to be where the error is occurring here). Or would you also use that as your primary Tokenizer (which would mean you don't also get compound words directly out of Kuromoji). {quote} In my case, I use this mode as my primary Tokenizer configuration since I usually want to have decompound tokens. It would be nice if synonym filter and synonym graph filter can work with this mode without the patch. However, I don't think there are many situations that we need original tokens along with decompound ones (I cannot say we will never need though). Current workaround for this issue is using normal mode that will not produce decompound tokens. But, for example, we cannot get a document that contains 株式会社 by using a query 会社 because 株式会社 will be one token and normal mode doesn't produce decoumpound tokens that will produce two tokens 株式 and 会社 (in this case, we can use n-gram in addition to tokenize field to get a document but it has other issues). I will try to find out that one which dedicated issue for the filter. If there's no one, I will create a ticket to record the issue. > JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter > ----------------------------------------------------------------------- > > Key: LUCENE-9123 > URL: https://issues.apache.org/jira/browse/LUCENE-9123 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 8.4 > Reporter: Kazuaki Hiraga > Priority: Major > Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch > > > JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with > both of SynonymGraphFilter and SynonymFilter when JT generates multiple > tokens as an output. If we use `mode=normal`, it should be fine. However, we > would like to use decomposed tokens that can maximize to chance to increase > recall. > Snippet of schema: > {code:xml} > <fieldType name="text_custom_ja" class="solr.TextField" > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > <analyzer> > <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/> > <filter class="solr.SynonymGraphFilterFactory" > synonyms="lang/synonyms_ja.txt" > tokenizerFactory="solr.JapaneseTokenizerFactory"/> > <filter class="solr.JapaneseBaseFormFilterFactory"/> > <!-- Removes tokens with certain part-of-speech tags --> > <filter class="solr.JapanesePartOfSpeechStopFilterFactory" > tags="lang/stoptags_ja.txt" /> > <!-- Normalizes full-width romaji to half-width and half-width kana > to full-width (Unicode NFKC subset) --> > <filter class="solr.CJKWidthFilterFactory"/> > <!-- Removes common tokens typically not useful for search, but have > a negative effect on ranking --> > <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" > words="lang/stopwords_ja.txt" /> --> > <!-- Normalizes common katakana spelling variations by removing any > last long sound character (U+30FC) --> > <filter class="solr.JapaneseKatakanaStemFilterFactory" > minimumLength="4"/> > <!-- Lower-cases romaji characters --> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > {code} > An synonym entry that generates error: > {noformat} > 株式会社,コーポレーション > {noformat} > The following is an output on console: > {noformat} > $ ./bin/solr create_core -c jp_test -d ../config/solrconfs > ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] > Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 > (got: 0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org