gsmiller commented on code in PR #12472: URL: https://github.com/apache/lucene/pull/12472#discussion_r1295211741
########## lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java: ########## @@ -227,19 +227,24 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA // start.addTransition(new Transition(endUTF8.byteAt(upto) & // (~MASKS[endUTF8.numBits(upto)-1]), endUTF8.byteAt(upto), end)); // type=end utf8.addTransition( - start, - end, - endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]), - endUTF8.byteAt(upto)); + start, end, endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto)]), endUTF8.byteAt(upto)); } else { final int startCode; - if (endUTF8.numBits(upto) == 5) { - // special case -- avoid created unused edges (endUTF8 - // doesn't accept certain byte sequences) -- there - // are other cases we could optimize too: - startCode = 194; + if (endUTF8.len == 2) { + assert upto == 0; // the upto==1 case will be handled by the first if above + // the first length=2 UTF8 Unicode character is C2 80, + // so we must special case 0xC2 as the 1st byte. + startCode = 0xC2; + } else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) { + // the first length=3 UTF8 Unicode character is E0 A0 80, + // so we must special case 0xA0 as the 2nd byte when E0 was the first byte of endUTF8. + startCode = 0xA0; + } else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) { Review Comment: Right, of course. That makes sense. My brain was a bit tired at the end of the day yesterday when I was looking through this, and I had an "off by one" bug. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org