[
https://issues.apache.org/jira/browse/OPENNLP-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Martin Wiesner updated OPENNLP-1615:
------------------------------------
Summary: Train and release more languages of UD-based OpenNLP models
(was: Train and release further languages of UD-based OpenNLP models )
> Train and release more languages of UD-based OpenNLP models
> ------------------------------------------------------------
>
> Key: OPENNLP-1615
> URL: https://issues.apache.org/jira/browse/OPENNLP-1615
> Project: OpenNLP
> Issue Type: Improvement
> Components: Models
> Reporter: Martin Wiesner
> Assignee: Martin Wiesner
> Priority: Major
> Fix For: 2.4.1
>
>
> As [https://universaldependencies.org|https://universaldependencies.org/]
> offers treebanks for many languages, we should add further basic, pre-trained
> models (Sentence detection, Tokenizer, POS tagging).
> A first investigation has shown promising results for the following languages:
> * “Bulgarian|bg|BTB”
> * “Czech|cs|PDT”
> * “Croatian|hr|SET”
> * “Danish|da|DDT”
> * “Estonian|et|EDT”
> * “Finnish|fi|TDT”
> * “Latvian|lv|LVTB”
> * “Norwegian|no|Bokmaal”
> * “Polish|pl|PDB”
> * “Portuguese|pt|GSD”
> * “Romanian|ro|RRT”
> * “Russian|ru|GSD”
> * “Serbian|sr|SET”
> * “Slovak|sk|SNK”
> * “Slovenian|sl|SSJ”
> * “Spanish|es|GSD”
> * “Swedish|sv|Talbanken”
> * “Ukrainian|uk|IU”
> The training succeeded and the eval results revealed a solid to excellent
> performance.
> Previously available languages, that is EN, FR, DE, NL, IT, should also be
> retrained.
> Aims:
> * (Re-)Train the three models per language listed above with UD release 2.14
> * Package and release as JAR files via Maven Central
> * Optional (?): Release the model files via the classic channel (website)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)