[jira] [Commented] (LUCENE-9112) OpenNLP tokenizer is fooled by text containing spurious punctuation

Markus Jelsma (Jira) Tue, 31 Dec 2019 04:21:26 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006076#comment-17006076
 ]


Markus Jelsma commented on LUCENE-9112:
---------------------------------------

Hello [~sarowe],

I first spotted the issue with a Dutch and an English sample using those 
ancient OpenNLP models from SourceForge.

I just trained new English and Dutch models based on 250k line CONLLU data sets 
and tried again to see if the splitting behaviour is still there. I had to 
adjust the test only slightly but the splitting problem is still there, and in 
my local tests the problem persists in Dutch too. At some seemingly arbitrary 
point further in the text, a 'random' term is being split.

I then tried the fresh models using OpenNLP's SentenceDetector and TokenizerME 
tools but i cannot reproduce the problem on the command line using these tools.

Issue #1 is fixed using the new models though. 

I am quite new to custom Tokenizer implementations and certainly those 
extending SegmentingTokenizerBase. What do you think? 

Thanks,
Markus

> OpenNLP tokenizer is fooled by text containing spurious punctuation
> -------------------------------------------------------------------
>
>                 Key: LUCENE-9112
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9112
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: master (9.0)
>            Reporter: Markus Jelsma
>            Priority: Major
>              Labels: opennlp
>             Fix For: master (9.0)
>
>         Attachments: LUCENE-9112-unittest.patch
>
>
> The OpenNLP tokenizer show weird behaviour when text contains spurious 
> punctuation such as having triple dots trailing a sentence...
> # the first dot becomes part of the token, having 'sentence.' becomes the 
> token
> # much further down the text, a seemingly unrelated token is then suddenly 
> split up, in my example (see attached unit test) the name 'Baron' is  split 
> into 'Baro' and 'n', this is the real problem
> The problems never seem to occur when using small texts in unit tests but it 
> certainly does in real world examples. Depending on how many 'spurious' dots, 
> a completely different term can become split, or the same term in just a 
> different location.
> I am not too sure if this is actually a problem in the Lucene code, but it is 
> a problem and i have a Lucene unit test proving the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9112) OpenNLP tokenizer is fooled by text containing spurious punctuation

Reply via email to