cbuescher commented on a change in pull request #1073: LUCENE-9088:
JapaneseNumberFilter uses inaccurate PartOfSpeechAttribute
URL: https://github.com/apache/lucene-solr/pull/1073#discussion_r356639454
##########
File path:
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseNumberFilter.java
##########
@@ -218,6 +228,11 @@ public final boolean incrementToken() throws IOException {
// capture the state of this token and emit it on our next
incrementToken()
state = captureState();
}
+ // we restore state to when we read the last numeral token to get its
attributes (e.g. part-of-speech)
+ if (lastNumeralTokenState != null) {
+ restoreState(lastNumeralTokenState);
Review comment:
Note: simply setting the PartOfSpeechAttribute to "noun-numeric" on the
emited token wasn't as straight forward as I expected, since the implementation
wraps a whole `org.apache.lucene.analysis.ja.Token`. This is why I explored
tracking and restoring the last "good" tokens state here.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]