tang-hi commented on code in PR #12472:
URL: https://github.com/apache/lucene/pull/12472#discussion_r1294796518


##########
lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java:
##########
@@ -227,19 +227,24 @@ private void end(int start, int end, UTF8Sequence 
endUTF8, int upto, boolean doA
       // start.addTransition(new Transition(endUTF8.byteAt(upto) &
       // (~MASKS[endUTF8.numBits(upto)-1]), endUTF8.byteAt(upto), end));   // 
type=end
       utf8.addTransition(
-          start,
-          end,
-          endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]),
-          endUTF8.byteAt(upto));
+          start, end, endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto)]), 
endUTF8.byteAt(upto));
     } else {
       final int startCode;
-      if (endUTF8.numBits(upto) == 5) {
-        // special case -- avoid created unused edges (endUTF8
-        // doesn't accept certain byte sequences) -- there
-        // are other cases we could optimize too:
-        startCode = 194;
+      if (endUTF8.len == 2) {
+        assert upto == 0; // the upto==1 case will be handled by the first if 
above
+        // the first length=2 UTF8 Unicode character is C2 80,
+        // so we must special case 0xC2 as the 1st byte.
+        startCode = 0xC2;
+      } else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) {
+        // the first length=3 UTF8 Unicode character is E0 A0 80,
+        // so we must special case 0xA0 as the 2nd byte when E0 was the first 
byte of endUTF8.
+        startCode = 0xA0;
+      } else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) {

Review Comment:
   If you comment it out, the test `testUTF8SpanMultipleBytes` in 
`TestUnicodeUtil.java` will fail. This is because when there is a transition 
span from 0xFFFF (3 bytes) to 0x10000 (4 bytes), it will produce an incorrect 
result.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to