schlosna commented on PR #14678:
URL: https://github.com/apache/lucene/pull/14678#issuecomment-2888508407

   > There's intentionally not a `BytesRef(String)` constructor, as there have 
been issues before.
   
   @rmuir Do you happen to recall what issues a `BytesRef(String)` constructor 
create and any history there? 
   
   I imagine you're referring to the case where the provided text is not 
"well-formed unicode text, with no unpaired surrogates" as defined in the 
`BytesRef(CharSequence)` constructor, and where 
`org.apache.lucene.util.UnicodeUtil#validUTF16String(java.lang.CharSequence)` 
returns false. `UnicodeUtil.UTF16toUTF8` used by `BytesRef(CharSequence)` 
replaces unpaired surrogates with `�` bytes `0xEF 0xBF 0xBD` while the 
`BytesRef(String)` constructor as proposed does no replacement or validation.
   
   
https://github.com/apache/lucene/blob/b0f992369ba967341d9b51f05f41ab25eb177e1d/lucene/core/src/java/org/apache/lucene/util/BytesRef.java#L77-L81
   
   
https://github.com/apache/lucene/blob/b0f992369ba967341d9b51f05f41ab25eb177e1d/lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java#L168-L172
   
   
   In https://github.com/apache/lucene/issues/12071 these is mention 
https://github.com/apache/lucene/issues/12071#issuecomment-1379313710 of using 
the vector APIs to speed up UnicodeUtil conversions. Has any of that been 
approached in lucene yet? I imagine a SIMD or SWAR approach like 
https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/
 & 
https://lemire.me/blog/2023/02/16/computing-the-utf-8-size-of-a-latin-1-string-quickly-avx-edition/
 would be worth investigating.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to