[ https://issues.apache.org/jira/browse/LUCENE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated LUCENE-9112: ---------------------------------- Attachment: LUCENE-9112-unittest.patch > OpenNLP tokenizer is fooled by text containing spurious punctuation > ------------------------------------------------------------------- > > Key: LUCENE-9112 > URL: https://issues.apache.org/jira/browse/LUCENE-9112 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: master (9.0) > Reporter: Markus Jelsma > Priority: Major > Labels: opennlp > Fix For: master (9.0) > > Attachments: LUCENE-9112-unittest.patch > > > The OpenNLP tokenizer show weird behaviour when text contains spurious > punctuation such as having triple dots trailing a sentence... > # the first dot becomes part of the token, having 'sentence.' becomes the > token > # much further down the text, a seemingly unrelated token is then suddenly > split up, in my example the name 'Baron' is split into 'Baro' and 'n', this > is the real problem > The problems never seem to occur when using small texts in unit tests but it > certainly does in real world examples. Depending on how many 'spurious' dots, > a completely different term can become split, or the same term in just a > different location. > I am not too sure if this is actually a problem in the Lucene code, but it is > a problem and i have a Lucene unit test proving the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org