Hrayr Matevosyan created OPENNLP-1563:
-----------------------------------------
Summary: SimpleTokenizer.tokenizePos incorrectly splits words with
non-spacing letters
Key: OPENNLP-1563
URL: https://issues.apache.org/jira/browse/OPENNLP-1563
Project: OpenNLP
Issue Type: Bug
Components: Tokenizer
Affects Versions: 2.3.3
Reporter: Hrayr Matevosyan
The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes words
containing non-spacing letters. For example, the Arabic word "طُوّر" gets
tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)