mikemccand commented on code in PR #12472:
URL: https://github.com/apache/lucene/pull/12472#discussion_r1278544730


##########
lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java:
##########
@@ -238,6 +238,10 @@ private void end(int start, int end, UTF8Sequence endUTF8, 
int upto, boolean doA
         // doesn't accept certain byte sequences) -- there
         // are other cases we could optimize too:
         startCode = 194;

Review Comment:
   Let's maybe fix this one to hex as well (0xC2)?



##########
lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java:
##########
@@ -238,6 +238,10 @@ private void end(int start, int end, UTF8Sequence endUTF8, 
int upto, boolean doA
         // doesn't accept certain byte sequences) -- there
         // are other cases we could optimize too:

Review Comment:
   Is this comment (`there are other cases we could optimize too`) still true 
:)  Or are these two new ifs covering them AND fixing this sneaky bug?



##########
lucene/core/src/test/org/apache/lucene/util/TestUnicodeUtil.java:
##########
@@ -188,6 +191,30 @@ public void testUTF8CodePointAt() {
     }
   }
 
+  public void testUTF8TwoToThreeBytes() throws Exception {

Review Comment:
   Maybe also add the three-to-four and one-to-two cases?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to