tang-hi commented on code in PR #12472: URL: https://github.com/apache/lucene/pull/12472#discussion_r1294796518
########## lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java: ########## @@ -227,19 +227,24 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA // start.addTransition(new Transition(endUTF8.byteAt(upto) & // (~MASKS[endUTF8.numBits(upto)-1]), endUTF8.byteAt(upto), end)); // type=end utf8.addTransition( - start, - end, - endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]), - endUTF8.byteAt(upto)); + start, end, endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto)]), endUTF8.byteAt(upto)); } else { final int startCode; - if (endUTF8.numBits(upto) == 5) { - // special case -- avoid created unused edges (endUTF8 - // doesn't accept certain byte sequences) -- there - // are other cases we could optimize too: - startCode = 194; + if (endUTF8.len == 2) { + assert upto == 0; // the upto==1 case will be handled by the first if above + // the first length=2 UTF8 Unicode character is C2 80, + // so we must special case 0xC2 as the 1st byte. + startCode = 0xC2; + } else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) { + // the first length=3 UTF8 Unicode character is E0 A0 80, + // so we must special case 0xA0 as the 2nd byte when E0 was the first byte of endUTF8. + startCode = 0xA0; + } else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) { Review Comment: If you comment it out, the test `testUTF8SpanMultipleBytes` in `TestUnicodeUtil.java` will fail. This is because when there is a transition span from 0xFFFF (3 bytes) to 0x10000 (4 bytes), it will produce an incorrect result. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org