[jira] [Commented] (LUCENE-9112) OpenNLP tokenizer is fooled by text containing spurious punctuation

Markus Jelsma (Jira) Mon, 06 Jan 2020 05:55:23 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008871#comment-17008871
 ]


Markus Jelsma commented on LUCENE-9112:
---------------------------------------

SegmentingTokenizerBase works fine on texts smaller than 1024. Any term that 
occupies the 1024th position is split due to this bug. Ideally, the class 
should refill the buffer and move on for each full sentence it takes, there are 
hardly any sentences over 1024 characters. But judging from the println i see, 
it does not do that, or incorrectly.

I am going to work around the problem for now by splitting my text into 
paragraphs using newlines. However, paragraphs larger than 1024 will be a 
problem. I have checked my text sources on paragraph length and they usually do 
not exceed it, but paragraphs longer than 1024 are common enough, so i'll 
attach the simplest patch that 'fixes' that part for my case.

> OpenNLP tokenizer is fooled by text containing spurious punctuation
> -------------------------------------------------------------------
>
>                 Key: LUCENE-9112
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9112
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: master (9.0)
>            Reporter: Markus Jelsma
>            Priority: Major
>              Labels: opennlp
>             Fix For: master (9.0)
>
>         Attachments: LUCENE-9112-unittest.patch, LUCENE-9112-unittest.patch, 
> en-sent.bin, en-token.bin
>
>
> The OpenNLP tokenizer show weird behaviour when text contains spurious 
> punctuation such as having triple dots trailing a sentence...
> # the first dot becomes part of the token, having 'sentence.' becomes the 
> token
> # much further down the text, a seemingly unrelated token is then suddenly 
> split up, in my example (see attached unit test) the name 'Baron' is  split 
> into 'Baro' and 'n', this is the real problem
> The problems never seem to occur when using small texts in unit tests but it 
> certainly does in real world examples. Depending on how many 'spurious' dots, 
> a completely different term can become split, or the same term in just a 
> different location.
> I am not too sure if this is actually a problem in the Lucene code, but it is 
> a problem and i have a Lucene unit test proving the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9112) OpenNLP tokenizer is fooled by text containing spurious punctuation

Reply via email to