gsmiller opened a new issue, #12458: URL: https://github.com/apache/lucene/issues/12458
### Description When converting a unicode (UTF32) automaton down to a UTF8 representation, UTF32toUTF8 can create an automaton that produces/accepts invalid UTF8. This happens when a transition in the input automaton contains a code-point range that spans different UTF8 byte counts. As an example/repro, create an automaton that accepts the two single-character terms: `U+65535` and `U+65536` (sample code for this in the repro test case below). Before compiling, the UTF32 representation is:  This range correctly covers `65535 - 65536` (`0xffff - 0x10000`). Note that the min value (65535) is the largest value that can be represented in three UTF8 bytes (it's representation is `11101111 10111111 10111111` in UTF8), and the max value (65536) is the smallest valid UTF8 representation that uses four bytes (`11110000 10010000 10000000 10000000`). After converting with `UTF32toUTF8#convert`, the resulting automaton looks like this:  This automaton incorrectly accepts/produces everything in the range `0xf0 0x80 0x80 0x80` - `0xf0 0x8f 0xbf 0xbf`. These are four byte UTF8 sequences encoding all code points from `0x00 - 0xffff` (`0 - 65535`). So this automaton now incorrectly accepts/produces everything between `0 - 65534` (`65535` is valid of course). Because it's representing these values as four UTF8 bytes, I believe these are actually invalid UTF8 since my understanding is that code points should be represented with the minimal number of bytes, but I might be mistaken. The bug lies in the `UTF32toUTF8#end` logic. Here is a full test case that demonstrates the bug: ``` public void testFoo() throws Exception { Automaton.Builder b = new Automaton.Builder(); // start state: int s1 = b.createState(); // single end accept state: int s2 = b.createState(); b.setAccept(s2, true); // add two single-code-point terms: U+65535 and U+65536 b.addTransition(s1, s2, 65535); b.addTransition(s1, s2, 65536); Automaton a = b.finish(); CompiledAutomaton c = new CompiledAutomaton(a); FiniteStringsIterator it = new FiniteStringsIterator(c.automaton); int termCount = 0; for (IntsRef r = it.next(); r != null; r = it.next()) { termCount++; } assertEquals(2, termCount); } ``` The assertion fails as: ``` expected:<2> but was:<65538> ``` ### Version and environment details _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org