[ https://issues.apache.org/jira/browse/LUCENE-9088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992910#comment-16992910 ]
Jim Ferenczi commented on LUCENE-9088: -------------------------------------- I don't think this behavior is documented. The javadocs says: {noformat} * Also notice that token attributes such as * \{@link org.apache.lucene.analysis.ja.tokenattributes.PartOfSpeechAttribute}, * \{@link org.apache.lucene.analysis.ja.tokenattributes.ReadingAttribute}, * \{@link org.apache.lucene.analysis.ja.tokenattributes.InflectionAttribute} and * \{@link org.apache.lucene.analysis.ja.tokenattributes.BaseFormAttribute} are left * unchanged and will inherit the values of the last token used to compose the normalized * number and can be wrong. Hence, for 10万 (10000), we will have * \{@link org.apache.lucene.analysis.ja.tokenattributes.ReadingAttribute} * set to マン. This is a known issue and is subject to a future improvement. * <p> {noformat} but that doesn't explain why we use the POS of the token following a grouped number. IMO this is a bug that we should fix in order to ensure that the POS stop filter can be used to remove the punctuations that was needed to detect the numbers. > JapaneseNumberFilter uses inaccurate PartOfSpeechAttribute > ---------------------------------------------------------- > > Key: LUCENE-9088 > URL: https://issues.apache.org/jira/browse/LUCENE-9088 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Reporter: Christoph Büscher > Priority: Major > > According to the JapaneseNumberFilter javadocs, it uses the attribute values > of the last token used to compose the normalized number, which can be wrong. > While this is documented it leads to a number of incompatibilities with other > japanese token filters. > For example, the PartOfSpeechAttribute of the last token used for an input > text of "2008 2009" will lead to an the following output (some attributes > left out...): > ``` > { > "token" : "2008", > "start_offset" : 0, > "end_offset" : 4, > "type" : "word", > [...] > "partOfSpeech" : "記号-空白", > "partOfSpeech (en)" : "symbol-space" > [...] > }, > { > "token" : " ", > "start_offset" : 4, > "end_offset" : 5, > "type" : "word", > [...] > "partOfSpeech" : "記号-空白", > "partOfSpeech (en)" : "symbol-space", > [...] > }, > { > "token" : "2009", > "start_offset" : 5, > "end_offset" : 9, > "type" : "word", > ... > "partOfSpeech" : "名詞-数", > "partOfSpeech (en)" : "noun-numeric", > } > ``` > so that e.g. a following `{color:#1d1c1d}kuromoji_part_of_speech{color}` > filter will eliminate the "2008" token erroneously tagged as "symbol-space". > Even without fixing the other token attrobutes, the POS attributes should > IMHO be set to "noun-numeric", since that's what the filter is supposed to > detect. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org