[
https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849899#comment-17849899
]
ASF GitHub Bot commented on OPENNLP-1563:
-----------------------------------------
rzo1 commented on code in PR #602:
URL: https://github.com/apache/opennlp/pull/602#discussion_r1616664137
##########
opennlp-tools/src/test/java/opennlp/tools/tokenize/SimpleTokenizerTest.java:
##########
@@ -128,4 +128,18 @@ void testTokenizationOfStringWithWindowsNewLineTokens() {
Assertions.assertArrayEquals(new String[] {"a", "\r", "\n", "\r", "\n",
"b", "\r", "\n", "\r", "\n", "c"},
tokenizer.tokenize("a\r\n\r\n b\r\n\r\n c"));
}
+
+ /**
+ * Tests if it can tokenize a word containing a non-spacing character
+ * like Arabic Damma Unicode Character “◌ُ” (U+064F)
+ */
+ @Test
+ void testNonSpacingLetters() {
+ String text = "طُوّر";
Review Comment:
@demq Can we have a full sentence example here?
> SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters
> -----------------------------------------------------------------------------
>
> Key: OPENNLP-1563
> URL: https://issues.apache.org/jira/browse/OPENNLP-1563
> Project: OpenNLP
> Issue Type: Bug
> Components: Tokenizer
> Affects Versions: 2.3.3
> Reporter: Hrayr Matevosyan
> Priority: Major
>
> The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes
> words containing non-spacing letters. For example, the Arabic word "طُوّر"
> gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)