[GitHub] [lucene] gsmiller commented on a diff in pull request #12472: Fix UTF32toUTF8 will produce invalid transition

via GitHub Mon, 14 Aug 2023 18:24:57 -0700


gsmiller commented on code in PR #12472:
URL: https://github.com/apache/lucene/pull/12472#discussion_r1294103023



##########
lucene/CHANGES.txt:
##########
@@ -182,6 +182,10 @@ Bug Fixes
 * GITHUB#12451: Change TestStringsToAutomaton validation to avoid automaton 
conversion bug discovered in GH#12458
   (Greg Miller).
 
+* GITHUB#2472: UTF32ToUTF8 would sometimes accept extra invalid UTF-8 binary 
sequences.  This should not have any

Review Comment:
   Unless a user was directly using `UTF32ToUTF8#convert` right? Maybe you 
could mention that somehow?



##########
lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java:
##########
@@ -227,19 +227,24 @@ private void end(int start, int end, UTF8Sequence 
endUTF8, int upto, boolean doA
       // start.addTransition(new Transition(endUTF8.byteAt(upto) &
       // (~MASKS[endUTF8.numBits(upto)-1]), endUTF8.byteAt(upto), end));   // 
type=end
       utf8.addTransition(
-          start,
-          end,
-          endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]),
-          endUTF8.byteAt(upto));
+          start, end, endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto)]), 
endUTF8.byteAt(upto));
     } else {
       final int startCode;
-      if (endUTF8.numBits(upto) == 5) {
-        // special case -- avoid created unused edges (endUTF8
-        // doesn't accept certain byte sequences) -- there
-        // are other cases we could optimize too:
-        startCode = 194;
+      if (endUTF8.len == 2) {

Review Comment:
   Maybe we should reference the issue number in a comment here that tracked 
the bug for future readers? It's a pretty tricky bug, so having the issue 
number in a comment might be useful?



##########
lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java:
##########
@@ -227,19 +227,24 @@ private void end(int start, int end, UTF8Sequence 
endUTF8, int upto, boolean doA
       // start.addTransition(new Transition(endUTF8.byteAt(upto) &
       // (~MASKS[endUTF8.numBits(upto)-1]), endUTF8.byteAt(upto), end));   // 
type=end

Review Comment:
   very minor, but could we either remove this commented-out code or also 
update the index used against MASKS to reflect the offset change you're making 
here? (I'd probably vote for removing the commented out code completely)



##########
lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java:
##########
@@ -227,19 +227,24 @@ private void end(int start, int end, UTF8Sequence 
endUTF8, int upto, boolean doA
       // start.addTransition(new Transition(endUTF8.byteAt(upto) &
       // (~MASKS[endUTF8.numBits(upto)-1]), endUTF8.byteAt(upto), end));   // 
type=end
       utf8.addTransition(
-          start,
-          end,
-          endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]),
-          endUTF8.byteAt(upto));
+          start, end, endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto)]), 
endUTF8.byteAt(upto));
     } else {
       final int startCode;
-      if (endUTF8.numBits(upto) == 5) {
-        // special case -- avoid created unused edges (endUTF8
-        // doesn't accept certain byte sequences) -- there
-        // are other cases we could optimize too:
-        startCode = 194;
+      if (endUTF8.len == 2) {
+        assert upto == 0; // the upto==1 case will be handled by the first if 
above
+        // the first length=2 UTF8 Unicode character is C2 80,
+        // so we must special case 0xC2 as the 1st byte.
+        startCode = 0xC2;
+      } else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) {
+        // the first length=3 UTF8 Unicode character is E0 A0 80,
+        // so we must special case 0xA0 as the 2nd byte when E0 was the first 
byte of endUTF8.
+        startCode = 0xA0;
+      } else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) {

Review Comment:
   I'm sort of contorting my brain to try to figure out if this can ever 
actually happen since the max UTF8 length is 4 bytes. I'm not opposed to 
keeping it, but for my own sanity, did you actually trip this problem with a 
test case or see something I'm overlooking?



##########
lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java:
##########
@@ -35,12 +35,12 @@ public final class UTF32ToUTF8 {
   private static final int[] startCodes = new int[] {0, 128, 2048, 65536};
   private static final int[] endCodes = new int[] {127, 2047, 65535, 1114111};
 
-  static int[] MASKS = new int[32];
+  static int[] MASKS = new int[8];
 
   static {
     int v = 2;
-    for (int i = 0; i < 32; i++) {
-      MASKS[i] = v - 1;
+    for (int i = 0; i < 7; i++) {

Review Comment:
   Nice tidy-up!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #12472: Fix UTF32toUTF8 will produce invalid transition

Reply via email to