[GitHub] [lucene] kotman12 opened a new issue, #11735: Incorrect sentence boundaries with repeating tokens in OpenNLP package

GitBox Thu, 01 Sep 2022 11:36:41 -0700


kotman12 opened a new issue, #11735:
URL: https://github.com/apache/lucene/issues/11735


   ### Description
   
   **Initial issue**: `KeywordRepeatFilter `+ `OpenNLPLLemmatizer` leads to 
empty token list in case of a single token stream.
   
   **Steps to re-produce**: run 
[TestOpenNLPLemmatizerFilterFactory.testNoBreakWithRepeatKeywordFilter](https://github.com/kotman12/lucene/blob/fix-sentence-iteration/lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPLemmatizerFilterFactory.java#L298)
 and observe that 0 tokens are returned after processing the text “period”.
   
   **Underlying issue**: opennlp package mishandles sentence boundary detection 
in general when KeywordRepeatFilter is added. The issue flies under the radar 
because the tests don’t verify which tokens are processed together as one 
sentence. Below is a screenshot showing that the _last_ token of the _last_ 
sentence gets dropped. This is usually not a big deal when that token is 
punctuation (most of the time) but can become especially problematic when the 
last bit of text of a stream has no punctuation. 
   
   For example consider the text "This is some sentence". If you pass this on 
its own into an analysis chain identical to the one configured in 
[TestOpenNLPLemmatizerFilterFactory.testNoBreakWithRepeatKeywordFilter](https://github.com/kotman12/lucene/blob/fix-sentence-iteration/lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPLemmatizerFilterFactory.java#L298)
 you will see this:
    
   
![image](https://user-images.githubusercontent.com/13710476/187983573-99b07eae-bc73-4be5-9e56-c3fbe73525fe.png)
   
   The `OpenNLPPOSFilter` has a similar issue although not quite as dramatic as 
`OpenNLPLLemmatizer`. This is a screenshot from a breakpoint in 
`OpenNLPLLemmatizer` after running the test 
[TestOpenNLPPOSFilterFactory.testNoBreakWithRepeatKeywordFilter:](https://github.com/kotman12/lucene/blob/fix-sentence-iteration/lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPPOSFilterFactory.java#L150)
    
   
![image](https://user-images.githubusercontent.com/13710476/187983765-066206fc-7ab0-4248-9d76-46cc35eea6ff.png)
   
![image](https://user-images.githubusercontent.com/13710476/187983780-fcaa1de1-c250-4455-be3a-553550e4c60b.png)
    
   Notice how the one sentence “No period” gets processed as two separate 
sentences. Functionally processing it as one sentence wouldn’t be very 
different (at least as far as the tests are concerned) but it is still most 
likely not the desired behavior.
   
   **Suggested fix**: Linking a [PR 
](https://github.com/apache/lucene/pull/11734) as the suggested fix for this. 
The gist is to use a one-step lookahead when processing the token stream to 
correctly detect sentence transition in the general case of repeating tokens. I 
have centralized the inner sentence token loop which had been repeated across 
the different sentence-aware filters. The suggested fix also removes other 
seemingly unnecessary conditional branching and tidies up the different 
open-nlp filters so they behave operate more similarly to one another (at least 
wherever possible)
   
   
   ### Version and environment details
   
   Latest version of lucene running jdk-17


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] kotman12 opened a new issue, #11735: Incorrect sentence boundaries with repeating tokens in OpenNLP package

Reply via email to