[GitHub] [lucene] gsmiller commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

via GitHub Fri, 26 May 2023 08:38:27 -0700


gsmiller commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1206972753



##########
lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java:
##########
@@ -477,38 +477,60 @@ public static int UTF8toUTF32(final BytesRef utf8, final 
int[] ints) {
     int utf8Upto = utf8.offset;
     final byte[] bytes = utf8.bytes;
     final int utf8Limit = utf8.offset + utf8.length;
+    UTF8CodePoint reuse = null;
     while (utf8Upto < utf8Limit) {
-      final int numBytes = utf8CodeLength[bytes[utf8Upto] & 0xFF];
-      int v = 0;
-      switch (numBytes) {
-        case 1:
-          ints[utf32Count++] = bytes[utf8Upto++];
-          continue;
-        case 2:
-          // 5 useful bits
-          v = bytes[utf8Upto++] & 31;
-          break;
-        case 3:
-          // 4 useful bits
-          v = bytes[utf8Upto++] & 15;
-          break;
-        case 4:
-          // 3 useful bits
-          v = bytes[utf8Upto++] & 7;
-          break;
-        default:
-          throw new IllegalArgumentException("invalid utf8");
-      }
+      reuse = codePointAt(bytes, utf8Upto, reuse);
+      ints[utf32Count++] = reuse.codePoint;
+      utf8Upto += reuse.codePointBytes;
+    }
 
-      // TODO: this may read past utf8's limit.
-      final int limit = utf8Upto + numBytes - 1;
-      while (utf8Upto < limit) {
-        v = v << 6 | bytes[utf8Upto++] & 63;
+    return utf32Count;
+  }
+
+  /**
+   * Computes the codepoint and codepoint length (in bytes) of the specified 
{@code offset} in the
+   * provided {@code utf8} byte array, assuming UTF8 encoding. As with other 
related methods in this
+   * class, this assumes valid UTF8 input and <strong>does not 
perform</strong> full UTF8
+   * validation.
+   *
+   * @throws IllegalArgumentException If invalid codepoint header byte occurs 
or the content is
+   *     prematurely truncated.
+   */
+  public static UTF8CodePoint codePointAt(byte[] utf8, int pos, UTF8CodePoint 
reuse) {
+    if (reuse == null) {
+      reuse = new UTF8CodePoint();
+    }
+
+    int leadByte = utf8[pos] & 0xFF;
+    int numBytes = utf8CodeLength[leadByte];
+    reuse.codePointBytes = numBytes;
+    int v;
+    switch (numBytes) {
+      case 1 -> {
+        reuse.codePoint = leadByte;
+        return reuse;
       }
-      ints[utf32Count++] = v;
+      case 2 -> v = leadByte & 31; // 5 useful bits
+      case 3 -> v = leadByte & 15; // 4 useful bits
+      case 4 -> v = leadByte & 7; // 3 useful bits
+      default -> throw new IllegalArgumentException("invalid utf8");

Review Comment:
   How about the header byte that resulted in an illegal parse? I'm a little 
nervous of including the whole substring of bytes as it has unbounded length 
and could be a bit unwieldy?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [lucene] gsmiller commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Reply via email to