rmuir commented on PR #14678: URL: https://github.com/apache/lucene/pull/14678#issuecomment-2888555604
> In #12071 these is mention [#12071 (comment)](https://github.com/apache/lucene/issues/12071#issuecomment-1379313710) of using the vector APIs to speed up UnicodeUtil conversions. Has any of that been approached in lucene yet? I imagine a SIMD or SWAR approach like https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/ & https://lemire.me/blog/2023/02/16/computing-the-utf-8-size-of-a-latin-1-string-quickly-avx-edition/ would be worth investigating. If you want to speed something up, pay no mind to java.lang.String as it isn't involved in the indexing anyway. Look here which is what the indexer calls jazillions of times https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/analysis/tokenattributes/CharTermAttributeImpl.java#L91-L95 it is just char[]->byte[] conversion, with the same reused char[] and byte[], there aren't allocations or strings or any of that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org