kotman12 opened a new issue, #11735: URL: https://github.com/apache/lucene/issues/11735
### Description **Initial issue**: `KeywordRepeatFilter `+ `OpenNLPLLemmatizer` leads to empty token list in case of a single token stream. **Steps to re-produce**: run [TestOpenNLPLemmatizerFilterFactory.testNoBreakWithRepeatKeywordFilter](https://github.com/kotman12/lucene/blob/fix-sentence-iteration/lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPLemmatizerFilterFactory.java#L298) and observe that 0 tokens are returned after processing the text “period”. **Underlying issue**: opennlp package mishandles sentence boundary detection in general when KeywordRepeatFilter is added. The issue flies under the radar because the tests don’t verify which tokens are processed together as one sentence. Below is a screenshot showing that the _last_ token of the _last_ sentence gets dropped. This is usually not a big deal when that token is punctuation (most of the time) but can become especially problematic when the last bit of text of a stream has no punctuation. For example consider the text "This is some sentence". If you pass this on its own into an analysis chain identical to the one configured in [TestOpenNLPLemmatizerFilterFactory.testNoBreakWithRepeatKeywordFilter](https://github.com/kotman12/lucene/blob/fix-sentence-iteration/lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPLemmatizerFilterFactory.java#L298) you will see this:  The `OpenNLPPOSFilter` has a similar issue although not quite as dramatic as `OpenNLPLLemmatizer`. This is a screenshot from a breakpoint in `OpenNLPLLemmatizer` after running the test [TestOpenNLPPOSFilterFactory.testNoBreakWithRepeatKeywordFilter:](https://github.com/kotman12/lucene/blob/fix-sentence-iteration/lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPPOSFilterFactory.java#L150)   Notice how the one sentence “No period” gets processed as two separate sentences. Functionally processing it as one sentence wouldn’t be very different (at least as far as the tests are concerned) but it is still most likely not the desired behavior. **Suggested fix**: Linking a [PR ](https://github.com/apache/lucene/pull/11734) as the suggested fix for this. The gist is to use a one-step lookahead when processing the token stream to correctly detect sentence transition in the general case of repeating tokens. I have centralized the inner sentence token loop which had been repeated across the different sentence-aware filters. The suggested fix also removes other seemingly unnecessary conditional branching and tidies up the different open-nlp filters so they behave operate more similarly to one another (at least wherever possible) ### Version and environment details Latest version of lucene running jdk-17 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org