schlosna commented on PR #14678: URL: https://github.com/apache/lucene/pull/14678#issuecomment-2888508407
> There's intentionally not a `BytesRef(String)` constructor, as there have been issues before. @rmuir Do you happen to recall what issues a `BytesRef(String)` constructor create and any history there? I imagine you're referring to the case where the provided text is not "well-formed unicode text, with no unpaired surrogates" as defined in the `BytesRef(CharSequence)` constructor, and where `org.apache.lucene.util.UnicodeUtil#validUTF16String(java.lang.CharSequence)` returns false. `UnicodeUtil.UTF16toUTF8` used by `BytesRef(CharSequence)` replaces unpaired surrogates with `�` bytes `0xEF 0xBF 0xBD` while the `BytesRef(String)` constructor as proposed does no replacement or validation. https://github.com/apache/lucene/blob/b0f992369ba967341d9b51f05f41ab25eb177e1d/lucene/core/src/java/org/apache/lucene/util/BytesRef.java#L77-L81 https://github.com/apache/lucene/blob/b0f992369ba967341d9b51f05f41ab25eb177e1d/lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java#L168-L172 In https://github.com/apache/lucene/issues/12071 these is mention https://github.com/apache/lucene/issues/12071#issuecomment-1379313710 of using the vector APIs to speed up UnicodeUtil conversions. Has any of that been approached in lucene yet? I imagine a SIMD or SWAR approach like https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/ & https://lemire.me/blog/2023/02/16/computing-the-utf-8-size-of-a-latin-1-string-quickly-avx-edition/ would be worth investigating. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org