mikemccand commented on code in PR #12472: URL: https://github.com/apache/lucene/pull/12472#discussion_r1280597272
########## lucene/core/src/test/org/apache/lucene/util/TestUnicodeUtil.java: ########## @@ -188,6 +191,37 @@ public void testUTF8CodePointAt() { } } + public void testUTF8SpanMultipleBytes() throws Exception { + Automaton.Builder b = new Automaton.Builder(); + // start state: + int s1 = b.createState(); + + // single end accept state: + int s2 = b.createState(); + b.setAccept(s2, true); + + // utf8 codepoint length is 1 + b.addTransition(s1, s2, 0x7F); + // utf8 codepoint length is 2 + b.addTransition(s1, s2, 0x80); + b.addTransition(s1, s2, 0x7FF); + // utf8 codepoint length is 3 + b.addTransition(s1, s2, 0x800); + b.addTransition(s1, s2, 0xFFFF); + // utf8 codepoint length is 4 + b.addTransition(s1, s2, 0x10000); Review Comment: I'm pretty sure the Automaton builder collapses adjacent transitions like this, but for paranoia, could you add the range explicitly? E.g.: ``` b.addTransition(s1, s2, 0x7f, 0x80); ``` (Instead of two separate `addTransition` calls). And same for the other two (four) transitions? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org