[ 
https://issues.apache.org/jira/browse/LUCENE-9088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992910#comment-16992910
 ] 

Jim Ferenczi commented on LUCENE-9088:
--------------------------------------

I don't think this behavior is documented. The javadocs says: 

{noformat}
 * Also notice that token attributes such as
* \{@link org.apache.lucene.analysis.ja.tokenattributes.PartOfSpeechAttribute},
* \{@link org.apache.lucene.analysis.ja.tokenattributes.ReadingAttribute},
* \{@link org.apache.lucene.analysis.ja.tokenattributes.InflectionAttribute} and
* \{@link org.apache.lucene.analysis.ja.tokenattributes.BaseFormAttribute} are 
left
* unchanged and will inherit the values of the last token used to compose the 
normalized
* number and can be wrong. Hence, for 10万 (10000), we will have
* \{@link org.apache.lucene.analysis.ja.tokenattributes.ReadingAttribute}
* set to マン. This is a known issue and is subject to a future improvement.
* <p>
{noformat}

but that doesn't explain why we use the POS of the token following a grouped 
number. IMO this is a bug that we should fix in order to ensure that the POS 
stop filter can be used to remove the punctuations that was needed to detect 
the numbers.

 

> JapaneseNumberFilter uses inaccurate PartOfSpeechAttribute
> ----------------------------------------------------------
>
>                 Key: LUCENE-9088
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9088
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Christoph Büscher
>            Priority: Major
>
> According to the JapaneseNumberFilter javadocs, it uses the attribute values 
> of the last token used to compose the normalized number, which can be wrong. 
> While this is documented it leads to a number of incompatibilities with other 
> japanese token filters.
> For example, the PartOfSpeechAttribute of the last token used for an input 
> text of "2008 2009" will lead to an the following output (some attributes 
> left out...):
> ```
> {
>  "token" : "2008",
>  "start_offset" : 0,
>  "end_offset" : 4,
>  "type" : "word",
> [...]
> "partOfSpeech" : "記号-空白",
>  "partOfSpeech (en)" : "symbol-space"
> [...]
>  },
>  {
>  "token" : " ",
>  "start_offset" : 4,
>  "end_offset" : 5,
>  "type" : "word",
> [...]
> "partOfSpeech" : "記号-空白",
>  "partOfSpeech (en)" : "symbol-space",
> [...]
>  },
>  {
>  "token" : "2009",
>  "start_offset" : 5,
>  "end_offset" : 9,
>  "type" : "word",
> ...
>  "partOfSpeech" : "名詞-数",
>  "partOfSpeech (en)" : "noun-numeric",
>  }
> ```
> so that e.g. a following `{color:#1d1c1d}kuromoji_part_of_speech{color}` 
> filter will eliminate the "2008" token erroneously tagged as "symbol-space".
> Even without fixing the other token attrobutes, the POS attributes should 
> IMHO be set to "noun-numeric", since that's what the filter is supposed to 
> detect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to