msfroh opened a new pull request, #14194: URL: https://github.com/apache/lucene/pull/14194
### Description This allows users to use either a Penn or UD part-of-speech tagging model, but output tags in the other format. This allows users to combine a Penn POS tagging model with a lemmatizer model trained on UD tags, for example. For a quick reference on the two: * Penn: https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html * UD: https://universaldependencies.org/u/pos/ The conversion rules are also defined in https://github.com/apache/opennlp/blob/6daacd319b95c5937abca5ef99e24566825fe89f/opennlp-tools/src/main/java/opennlp/tools/postag/POSTagFormatMapper.java#L40 This commit also changes the default POSTagFormat to `CUSTOM` (whereas I previously set it to `PENN`), which just passes through the tag format from the POSTaggerModel. I believe this is a reasonable default, since new users are likely to use just the new UD models published at https://opennlp.apache.org/models.html, whereas existing users likely have Penn models Users only need to specify a POSTagFormat if they have a combination of models and need to convert between UD and Penn tag formats (to convert from a POSTaggerModel in one format to a lemmatizer or chunker model in the other format). Currently, the models used for the unit tests all use the Penn tag format. Retraining the models using the UD format can be addressed as part of https://github.com/apache/lucene/issues/13002 (which I may work on next). To verify the downstream consumption of UD tags by another filter, I manually updated the lemmatizer dictionary (a non-binary model) to add UD tags. Resolves https://github.com/apache/lucene/issues/14188 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org