[GitHub] [lucene] rmuir commented on issue #11976: End offset for compatibility characters is not incremented with ICUNormalizer2CharFilter

GitBox Fri, 25 Nov 2022 19:13:36 -0800


rmuir commented on issue #11976:
URL: https://github.com/apache/lucene/issues/11976#issuecomment-1327969322


   yes, normally composed/decomposed (NFC vs NFD) does not change tokenization. 
so you may do it before or after, doesn't matter.
   
   but compatibility characters like this don't really work well in unicode for 
text processing: they are just really for compatibility/round-trip. you have to 
apply NFKC/D first before you can really do anything with them. Maybe for now, 
normalize documents before you send them to elasticsearch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on issue #11976: End offset for compatibility characters is not incremented with ICUNormalizer2CharFilter

Reply via email to