Martin Wiesner created OPENNLP-1615:
---------------------------------------

             Summary: Provide further languages for UD-based OpenNLP models 
                 Key: OPENNLP-1615
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1615
             Project: OpenNLP
          Issue Type: Improvement
          Components: Models
            Reporter: Martin Wiesner
            Assignee: Martin Wiesner
             Fix For: 2.4.1


As [https://universaldependencies.org|https://universaldependencies.org/] 
offers treebanks for many languages, we should add further basic, pre-trained 
models (Sentence-Detection, Tokenizer, POS).

A first investigation has shown promising results for the following languages:

* “Bulgarian|bg|BTB”
* “Czech|cs|PDT”
* “Croatian|hr|SET”
* “Danish|da|DDT”
* “Estonian|et|EDT”
* “Finnish|fi|TDT”
* “Latvian|lv|LVTB”
* “Norwegian|no|Bokmaal”
* “Polish|pl|PDB”
* “Portuguese|pt|GSD”
* “Romanian|ro|RRT”
* Russian|ru|GSD”
* “Serbian|sr|SET”
* “Slovak|sk|SNK”
* “Slovenian|sl|SSJ”
* “Spanish|es|GSD”
* “Swedish|sv|Talbanken”
* “Ukrainian|uk|IU”

The training succeeded and the eval results revealed a solid to excellent 
performance.
Previously available languages, that is EN, FR, DE, NL, IT, should also be 
retrained.

Aims: 
* (Re-)Train the three models per language listed above with UD release 2.14
* Package and release as JAR files via Maven Central
* Optional (?): Release the model files via the classic channel (website)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to