mikemccand commented on code in PR #12472: URL: https://github.com/apache/lucene/pull/12472#discussion_r1278544730
########## lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java: ########## @@ -238,6 +238,10 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA // doesn't accept certain byte sequences) -- there // are other cases we could optimize too: startCode = 194; Review Comment: Let's maybe fix this one to hex as well (0xC2)? ########## lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java: ########## @@ -238,6 +238,10 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA // doesn't accept certain byte sequences) -- there // are other cases we could optimize too: Review Comment: Is this comment (`there are other cases we could optimize too`) still true :) Or are these two new ifs covering them AND fixing this sneaky bug? ########## lucene/core/src/test/org/apache/lucene/util/TestUnicodeUtil.java: ########## @@ -188,6 +191,30 @@ public void testUTF8CodePointAt() { } } + public void testUTF8TwoToThreeBytes() throws Exception { Review Comment: Maybe also add the three-to-four and one-to-two cases? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org