rmuir commented on issue #11976: URL: https://github.com/apache/lucene/issues/11976#issuecomment-1327969322
yes, normally composed/decomposed (NFC vs NFD) does not change tokenization. so you may do it before or after, doesn't matter. but compatibility characters like this don't really work well in unicode for text processing: they are just really for compatibility/round-trip. you have to apply NFKC/D first before you can really do anything with them. Maybe for now, normalize documents before you send them to elasticsearch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org