Martin Wiesner created OPENNLP-1615:
---------------------------------------
Summary: Provide further languages for UD-based OpenNLP models
Key: OPENNLP-1615
URL: https://issues.apache.org/jira/browse/OPENNLP-1615
Project: OpenNLP
Issue Type: Improvement
Components: Models
Reporter: Martin Wiesner
Assignee: Martin Wiesner
Fix For: 2.4.1
As [https://universaldependencies.org|https://universaldependencies.org/]
offers treebanks for many languages, we should add further basic, pre-trained
models (Sentence-Detection, Tokenizer, POS).
A first investigation has shown promising results for the following languages:
* “Bulgarian|bg|BTB”
* “Czech|cs|PDT”
* “Croatian|hr|SET”
* “Danish|da|DDT”
* “Estonian|et|EDT”
* “Finnish|fi|TDT”
* “Latvian|lv|LVTB”
* “Norwegian|no|Bokmaal”
* “Polish|pl|PDB”
* “Portuguese|pt|GSD”
* “Romanian|ro|RRT”
* Russian|ru|GSD”
* “Serbian|sr|SET”
* “Slovak|sk|SNK”
* “Slovenian|sl|SSJ”
* “Spanish|es|GSD”
* “Swedish|sv|Talbanken”
* “Ukrainian|uk|IU”
The training succeeded and the eval results revealed a solid to excellent
performance.
Previously available languages, that is EN, FR, DE, NL, IT, should also be
retrained.
Aims:
* (Re-)Train the three models per language listed above with UD release 2.14
* Package and release as JAR files via Maven Central
* Optional (?): Release the model files via the classic channel (website)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)