Re: [PR] Improve BytesRef creation from String [lucene]

via GitHub Sat, 17 May 2025 12:44:28 -0700


rmuir commented on PR #14678:
URL: https://github.com/apache/lucene/pull/14678#issuecomment-2888555604


   > In #12071 these is mention [#12071 
(comment)](https://github.com/apache/lucene/issues/12071#issuecomment-1379313710)
 of using the vector APIs to speed up UnicodeUtil conversions. Has any of that 
been approached in lucene yet? I imagine a SIMD or SWAR approach like 
https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/
 & 
https://lemire.me/blog/2023/02/16/computing-the-utf-8-size-of-a-latin-1-string-quickly-avx-edition/
 would be worth investigating.
   
   If you want to speed something up, pay no mind to java.lang.String as it 
isn't involved in the indexing anyway. Look here which is what the indexer 
calls jazillions of times 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/analysis/tokenattributes/CharTermAttributeImpl.java#L91-L95
   
   it is just char[]->byte[] conversion, with the same reused char[] and 
byte[], there aren't allocations or strings or any of that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Improve BytesRef creation from String [lucene]

Reply via email to