gsmiller commented on code in PR #12472: URL: https://github.com/apache/lucene/pull/12472#discussion_r1294103023
########## lucene/CHANGES.txt: ########## @@ -182,6 +182,10 @@ Bug Fixes * GITHUB#12451: Change TestStringsToAutomaton validation to avoid automaton conversion bug discovered in GH#12458 (Greg Miller). +* GITHUB#2472: UTF32ToUTF8 would sometimes accept extra invalid UTF-8 binary sequences. This should not have any Review Comment: Unless a user was directly using `UTF32ToUTF8#convert` right? Maybe you could mention that somehow? ########## lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java: ########## @@ -227,19 +227,24 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA // start.addTransition(new Transition(endUTF8.byteAt(upto) & // (~MASKS[endUTF8.numBits(upto)-1]), endUTF8.byteAt(upto), end)); // type=end utf8.addTransition( - start, - end, - endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]), - endUTF8.byteAt(upto)); + start, end, endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto)]), endUTF8.byteAt(upto)); } else { final int startCode; - if (endUTF8.numBits(upto) == 5) { - // special case -- avoid created unused edges (endUTF8 - // doesn't accept certain byte sequences) -- there - // are other cases we could optimize too: - startCode = 194; + if (endUTF8.len == 2) { Review Comment: Maybe we should reference the issue number in a comment here that tracked the bug for future readers? It's a pretty tricky bug, so having the issue number in a comment might be useful? ########## lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java: ########## @@ -227,19 +227,24 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA // start.addTransition(new Transition(endUTF8.byteAt(upto) & // (~MASKS[endUTF8.numBits(upto)-1]), endUTF8.byteAt(upto), end)); // type=end Review Comment: very minor, but could we either remove this commented-out code or also update the index used against MASKS to reflect the offset change you're making here? (I'd probably vote for removing the commented out code completely) ########## lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java: ########## @@ -227,19 +227,24 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA // start.addTransition(new Transition(endUTF8.byteAt(upto) & // (~MASKS[endUTF8.numBits(upto)-1]), endUTF8.byteAt(upto), end)); // type=end utf8.addTransition( - start, - end, - endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]), - endUTF8.byteAt(upto)); + start, end, endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto)]), endUTF8.byteAt(upto)); } else { final int startCode; - if (endUTF8.numBits(upto) == 5) { - // special case -- avoid created unused edges (endUTF8 - // doesn't accept certain byte sequences) -- there - // are other cases we could optimize too: - startCode = 194; + if (endUTF8.len == 2) { + assert upto == 0; // the upto==1 case will be handled by the first if above + // the first length=2 UTF8 Unicode character is C2 80, + // so we must special case 0xC2 as the 1st byte. + startCode = 0xC2; + } else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) { + // the first length=3 UTF8 Unicode character is E0 A0 80, + // so we must special case 0xA0 as the 2nd byte when E0 was the first byte of endUTF8. + startCode = 0xA0; + } else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) { Review Comment: I'm sort of contorting my brain to try to figure out if this can ever actually happen since the max UTF8 length is 4 bytes. I'm not opposed to keeping it, but for my own sanity, did you actually trip this problem with a test case or see something I'm overlooking? ########## lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java: ########## @@ -35,12 +35,12 @@ public final class UTF32ToUTF8 { private static final int[] startCodes = new int[] {0, 128, 2048, 65536}; private static final int[] endCodes = new int[] {127, 2047, 65535, 1114111}; - static int[] MASKS = new int[32]; + static int[] MASKS = new int[8]; static { int v = 2; - for (int i = 0; i < 32; i++) { - MASKS[i] = v - 1; + for (int i = 0; i < 7; i++) { Review Comment: Nice tidy-up! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org