msfroh opened a new pull request, #14194:
URL: https://github.com/apache/lucene/pull/14194

   ### Description
   
   This allows users to use either a Penn or UD part-of-speech tagging model, 
but output tags in the other format. This allows users to combine a Penn POS 
tagging model with a lemmatizer model trained on UD tags, for example.
   
   For a quick reference on the two:
   * Penn: https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html
   * UD: https://universaldependencies.org/u/pos/
   
   The conversion rules are also defined in 
https://github.com/apache/opennlp/blob/6daacd319b95c5937abca5ef99e24566825fe89f/opennlp-tools/src/main/java/opennlp/tools/postag/POSTagFormatMapper.java#L40
   
   This commit also changes the default POSTagFormat to `CUSTOM` (whereas I 
previously set it to `PENN`), which just passes through the tag format from the 
POSTaggerModel. I believe this is a reasonable default, since new users are 
likely to use just the new UD models published at 
https://opennlp.apache.org/models.html, whereas existing users likely have Penn 
models
   
   Users only need to specify a POSTagFormat if they have a combination of 
models and need to convert between UD and Penn tag formats (to convert from a 
POSTaggerModel in one format to a lemmatizer or chunker model in the other 
format).
   
   Currently, the models used for the unit tests all use the Penn tag format. 
Retraining the models using the UD format can be addressed as part of 
https://github.com/apache/lucene/issues/13002 (which I may work on next). To 
verify the downstream consumption of UD tags by another filter, I manually 
updated the lemmatizer dictionary (a non-binary model) to add UD tags.
   
   Resolves https://github.com/apache/lucene/issues/14188
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to