[
https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Richard Zowalla resolved OPENNLP-1563.
--------------------------------------
Fix Version/s: 2.3.4
Resolution: Fixed
> SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters
> -----------------------------------------------------------------------------
>
> Key: OPENNLP-1563
> URL: https://issues.apache.org/jira/browse/OPENNLP-1563
> Project: OpenNLP
> Issue Type: Bug
> Components: Tokenizer
> Affects Versions: 2.3.3
> Reporter: Hrayr Matevosyan
> Priority: Major
> Fix For: 2.3.4
>
>
> The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes
> words containing non-spacing letters. For example, the Arabic word "طُوّر"
> gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)