gsmiller commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1206972753
##########
lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java:
##########
@@ -477,38 +477,60 @@ public static int UTF8toUTF32(final BytesRef utf8, final
int[] ints) {
int utf8Upto = utf8.offset;
final byte[] bytes = utf8.bytes;
final int utf8Limit = utf8.offset + utf8.length;
+ UTF8CodePoint reuse = null;
while (utf8Upto < utf8Limit) {
- final int numBytes = utf8CodeLength[bytes[utf8Upto] & 0xFF];
- int v = 0;
- switch (numBytes) {
- case 1:
- ints[utf32Count++] = bytes[utf8Upto++];
- continue;
- case 2:
- // 5 useful bits
- v = bytes[utf8Upto++] & 31;
- break;
- case 3:
- // 4 useful bits
- v = bytes[utf8Upto++] & 15;
- break;
- case 4:
- // 3 useful bits
- v = bytes[utf8Upto++] & 7;
- break;
- default:
- throw new IllegalArgumentException("invalid utf8");
- }
+ reuse = codePointAt(bytes, utf8Upto, reuse);
+ ints[utf32Count++] = reuse.codePoint;
+ utf8Upto += reuse.codePointBytes;
+ }
- // TODO: this may read past utf8's limit.
- final int limit = utf8Upto + numBytes - 1;
- while (utf8Upto < limit) {
- v = v << 6 | bytes[utf8Upto++] & 63;
+ return utf32Count;
+ }
+
+ /**
+ * Computes the codepoint and codepoint length (in bytes) of the specified
{@code offset} in the
+ * provided {@code utf8} byte array, assuming UTF8 encoding. As with other
related methods in this
+ * class, this assumes valid UTF8 input and <strong>does not
perform</strong> full UTF8
+ * validation.
+ *
+ * @throws IllegalArgumentException If invalid codepoint header byte occurs
or the content is
+ * prematurely truncated.
+ */
+ public static UTF8CodePoint codePointAt(byte[] utf8, int pos, UTF8CodePoint
reuse) {
+ if (reuse == null) {
+ reuse = new UTF8CodePoint();
+ }
+
+ int leadByte = utf8[pos] & 0xFF;
+ int numBytes = utf8CodeLength[leadByte];
+ reuse.codePointBytes = numBytes;
+ int v;
+ switch (numBytes) {
+ case 1 -> {
+ reuse.codePoint = leadByte;
+ return reuse;
}
- ints[utf32Count++] = v;
+ case 2 -> v = leadByte & 31; // 5 useful bits
+ case 3 -> v = leadByte & 15; // 4 useful bits
+ case 4 -> v = leadByte & 7; // 3 useful bits
+ default -> throw new IllegalArgumentException("invalid utf8");
Review Comment:
How about the header byte that resulted in an illegal parse? I'm a little
nervous of including the whole substring of bytes as it has unbounded length
and could be a bit unwieldy?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]