[ https://issues.apache.org/jira/browse/LUCENE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006076#comment-17006076 ]
Markus Jelsma commented on LUCENE-9112: --------------------------------------- Hello [~sarowe], I first spotted the issue with a Dutch and an English sample using those ancient OpenNLP models from SourceForge. I just trained new English and Dutch models based on 250k line CONLLU data sets and tried again to see if the splitting behaviour is still there. I had to adjust the test only slightly but the splitting problem is still there, and in my local tests the problem persists in Dutch too. At some seemingly arbitrary point further in the text, a 'random' term is being split. I then tried the fresh models using OpenNLP's SentenceDetector and TokenizerME tools but i cannot reproduce the problem on the command line using these tools. Issue #1 is fixed using the new models though. I am quite new to custom Tokenizer implementations and certainly those extending SegmentingTokenizerBase. What do you think? Thanks, Markus > OpenNLP tokenizer is fooled by text containing spurious punctuation > ------------------------------------------------------------------- > > Key: LUCENE-9112 > URL: https://issues.apache.org/jira/browse/LUCENE-9112 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: master (9.0) > Reporter: Markus Jelsma > Priority: Major > Labels: opennlp > Fix For: master (9.0) > > Attachments: LUCENE-9112-unittest.patch > > > The OpenNLP tokenizer show weird behaviour when text contains spurious > punctuation such as having triple dots trailing a sentence... > # the first dot becomes part of the token, having 'sentence.' becomes the > token > # much further down the text, a seemingly unrelated token is then suddenly > split up, in my example (see attached unit test) the name 'Baron' is split > into 'Baro' and 'n', this is the real problem > The problems never seem to occur when using small texts in unit tests but it > certainly does in real world examples. Depending on how many 'spurious' dots, > a completely different term can become split, or the same term in just a > different location. > I am not too sure if this is actually a problem in the Lucene code, but it is > a problem and i have a Lucene unit test proving the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org