mikemccand commented on code in PR #12472:
URL: https://github.com/apache/lucene/pull/12472#discussion_r1280597272


##########
lucene/core/src/test/org/apache/lucene/util/TestUnicodeUtil.java:
##########
@@ -188,6 +191,37 @@ public void testUTF8CodePointAt() {
     }
   }
 
+  public void testUTF8SpanMultipleBytes() throws Exception {
+    Automaton.Builder b = new Automaton.Builder();
+    // start state:
+    int s1 = b.createState();
+
+    // single end accept state:
+    int s2 = b.createState();
+    b.setAccept(s2, true);
+
+    // utf8 codepoint length is 1
+    b.addTransition(s1, s2, 0x7F);
+    // utf8 codepoint length is 2
+    b.addTransition(s1, s2, 0x80);
+    b.addTransition(s1, s2, 0x7FF);
+    // utf8 codepoint length is 3
+    b.addTransition(s1, s2, 0x800);
+    b.addTransition(s1, s2, 0xFFFF);
+    // utf8 codepoint length is 4
+    b.addTransition(s1, s2, 0x10000);

Review Comment:
   I'm pretty sure the Automaton builder collapses adjacent transitions like 
this, but for paranoia, could you add the range explicitly?  E.g.:
   
   ```
   b.addTransition(s1, s2, 0x7f, 0x80);
   ```
   
   (Instead of two separate `addTransition` calls).  And same for the other two 
(four) transitions?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to