[ https://issues.apache.org/jira/browse/LUCENE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006105#comment-17006105 ]
Markus Jelsma commented on LUCENE-9112: --------------------------------------- There it is: {code} usableLength = findSafeEnd(); if (usableLength < 0) usableLength = length; /* * more than IOBUFFER of text without breaks, * gonna possibly truncate tokens */ {code} The text i send to be analyzed no longer has newlines, or any character that is found by findSafeEnd(). > OpenNLP tokenizer is fooled by text containing spurious punctuation > ------------------------------------------------------------------- > > Key: LUCENE-9112 > URL: https://issues.apache.org/jira/browse/LUCENE-9112 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: master (9.0) > Reporter: Markus Jelsma > Priority: Major > Labels: opennlp > Fix For: master (9.0) > > Attachments: LUCENE-9112-unittest.patch, LUCENE-9112-unittest.patch, > en-sent.bin, en-token.bin > > > The OpenNLP tokenizer show weird behaviour when text contains spurious > punctuation such as having triple dots trailing a sentence... > # the first dot becomes part of the token, having 'sentence.' becomes the > token > # much further down the text, a seemingly unrelated token is then suddenly > split up, in my example (see attached unit test) the name 'Baron' is split > into 'Baro' and 'n', this is the real problem > The problems never seem to occur when using small texts in unit tests but it > certainly does in real world examples. Depending on how many 'spurious' dots, > a completely different term can become split, or the same term in just a > different location. > I am not too sure if this is actually a problem in the Lucene code, but it is > a problem and i have a Lucene unit test proving the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org