Martin Wiesner created OPENNLP-1555:
---------------------------------------
Summary: TokenizerME should detect multi-dot abbreviations
Key: OPENNLP-1555
URL: https://issues.apache.org/jira/browse/OPENNLP-1555
Project: OpenNLP
Issue Type: Improvement
Components: Tokenizer
Affects Versions: 2.3.3, 2.3.2, 2.3.1, 2.3.0, 2.2.0, 2.1.0
Reporter: Martin Wiesner
Assignee: Martin Wiesner
Fix For: 2.3.4
TokenizerME should detect and handle multi-dot abbreviations correctly.
Currently, this is not handled correctly. For instance,
German: "z.B." = "zum Beispiel" (for example) or,
Dutch: "e.v." = "en volgende" (and following)
are not tokenized correctly and extra tokens are returned. NOTE: no whitespaces
in between the dots in the above examples.
Aims:
* Fix the detection / handling of abbreviations for multi-dot abbreviations
* Provide test cases that cover these cases
--
This message was sent by Atlassian Jira
(v8.20.10#820010)