[ https://issues.apache.org/jira/browse/LUCENE-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305660#comment-17305660 ]
Tomoko Uchida commented on LUCENE-9853: --------------------------------------- Opened https://github.com/apache/lucene/pull/26. It is often recommended applying Unicode normalization or width normalization as pre-processing for Japanese morphological analysis, so I think this provides good defaults for many use cases. > Use CJKWidthCharFilter as the default character normalizer for > JapaneseAnalyzer instead of CJKWidthFilter > --------------------------------------------------------------------------------------------------------- > > Key: LUCENE-9853 > URL: https://issues.apache.org/jira/browse/LUCENE-9853 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Affects Versions: main (9.0) > Reporter: Tomoko Uchida > Assignee: Tomoko Uchida > Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Follow-up issue of LUCENE-9413. > We now have CJKWidthCharFilter in analyzers-common. I believe in many > situations it is recommended applying half-width/full-width character > normalization before tokenization for consistency in analysis. > The change slightly affects on the analyzer's outputs. We can provide a > parameter to switch back to CJKWidthFilter for backward compatibility. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org