[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17024058#comment-17024058
 ] 

Kazuaki Hiraga commented on LUCENE-9123:
----------------------------------------

[~romseygeek] Thank you for your comments. I think we can modify the output of 
Kuromoji to deal with the issue at this moment, which the current 
GraphTokenStream cannot deal with decompounded tokens since we don't think we 
need to keep original tokens along with decompounded ones for many situations. 
So, we can introduce a new option to absorb originals for now.  However, we 
think either SynonymGraphStream or ToeknStream should be able to deal with 
complex cases like you have mentioned in the future release of Lucene. 

[~tomoko], Thank you for your hard work! Please let me know if you have 
anything what I can help your testing or updating.  And thank you for creating 
a ticket that points out the issue in SynonymGraphFilter: LUCENE-9173. 


> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-9123
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9123
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 8.4
>            Reporter: Kazuaki Hiraga
>            Assignee: Tomoko Uchida
>            Priority: Major
>         Attachments: LUCENE-9123.patch, LUCENE-9123_8x.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>     <fieldType name="text_custom_ja" class="solr.TextField" 
> positionIncrementGap="100" autoGeneratePhraseQueries="false">
>       <analyzer>
>         <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>         <filter class="solr.SynonymGraphFilterFactory"
>                     synonyms="lang/synonyms_ja.txt"
>                     tokenizerFactory="solr.JapaneseTokenizerFactory"/>
>         <filter class="solr.JapaneseBaseFormFilterFactory"/>
>         <!-- Removes tokens with certain part-of-speech tags -->
>         <filter class="solr.JapanesePartOfSpeechStopFilterFactory" 
> tags="lang/stoptags_ja.txt" />
>         <!-- Normalizes full-width romaji to half-width and half-width kana 
> to full-width (Unicode NFKC subset) -->
>         <filter class="solr.CJKWidthFilterFactory"/>
>         <!-- Removes common tokens typically not useful for search, but have 
> a negative effect on ranking -->
>         <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="lang/stopwords_ja.txt" /> -->
>         <!-- Normalizes common katakana spelling variations by removing any 
> last long sound character (U+30FC) -->
>         <filter class="solr.JapaneseKatakanaStemFilterFactory" 
> minimumLength="4"/>
>         <!-- Lower-cases romaji characters -->
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to