[jira] [Commented] (OPENNLP-1563) SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters

ASF GitHub Bot (Jira) Tue, 28 May 2024 04:27:34 -0700


    [ 
https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849994#comment-17849994
 ]


ASF GitHub Bot commented on OPENNLP-1563:
-----------------------------------------

rzo1 merged PR #602:
URL: https://github.com/apache/opennlp/pull/602




> SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters
> -----------------------------------------------------------------------------
>
>                 Key: OPENNLP-1563
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1563
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Tokenizer
>    Affects Versions: 2.3.3
>            Reporter: Hrayr Matevosyan
>            Priority: Major
>
> The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes 
> words containing non-spacing letters. For example, the Arabic word "طُوّر" 
> gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (OPENNLP-1563) SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters

Reply via email to