[GitHub] [lucene] gsmiller opened a new issue, #12458: UTF32toUTF8 can produce automata that produce invalid unicode

via GitHub Mon, 24 Jul 2023 12:41:13 -0700


gsmiller opened a new issue, #12458:
URL: https://github.com/apache/lucene/issues/12458


   ### Description
   
   When converting a unicode (UTF32) automaton down to a UTF8 representation, 
UTF32toUTF8 can create an automaton that produces/accepts invalid UTF8. This 
happens when a transition in the input automaton contains a code-point range 
that spans different UTF8 byte counts.
   
   As an example/repro, create an automaton that accepts the two 
single-character terms: `U+65535` and `U+65536` (sample code for this in the 
repro test case below). Before compiling, the UTF32 representation is:
   
![out](https://github.com/apache/lucene/assets/16479560/de67f83e-0463-47f6-882b-470493c6492f)
   
   This range correctly covers `65535 - 65536` (`0xffff - 0x10000`). Note that 
the min value (65535) is the largest value that can be represented in three 
UTF8 bytes (it's representation is `11101111 10111111 10111111` in UTF8), and 
the max value (65536) is the smallest valid UTF8 representation that uses four 
bytes (`11110000 10010000 10000000 10000000`).
   
   After converting with `UTF32toUTF8#convert`, the resulting automaton looks 
like this:
   
![out](https://github.com/apache/lucene/assets/16479560/60035b56-a6ea-4653-97f5-68d1d97197de)
   
   This automaton incorrectly accepts/produces everything in the range `0xf0 
0x80 0x80 0x80` - `0xf0 0x8f 0xbf 0xbf`. These are four byte UTF8 sequences 
encoding all code points from `0x00 - 0xffff` (`0 - 65535`). So this automaton 
now incorrectly accepts/produces everything between `0 - 65534` (`65535` is 
valid of course). Because it's representing these values as four UTF8 bytes, I 
believe these are actually invalid UTF8 since my understanding is that code 
points should be represented with the minimal number of bytes, but I might be 
mistaken.
   
   The bug lies in the `UTF32toUTF8#end` logic.
   
   Here is a full test case that demonstrates the bug:
   ```
     public void testFoo() throws Exception {
       Automaton.Builder b = new Automaton.Builder();
       // start state:
       int s1 = b.createState();
   
       // single end accept state:
       int s2 = b.createState();
       b.setAccept(s2, true);
   
       // add two single-code-point terms: U+65535 and U+65536
       b.addTransition(s1, s2, 65535);
       b.addTransition(s1, s2, 65536);
   
       Automaton a = b.finish();
   
       CompiledAutomaton c = new CompiledAutomaton(a);
       FiniteStringsIterator it = new FiniteStringsIterator(c.automaton);
       int termCount = 0;
       for (IntsRef r = it.next(); r != null; r = it.next()) {
         termCount++;
       }
       assertEquals(2, termCount);
     }
   ```
   
   The assertion fails as:
   ```
   expected:<2> but was:<65538>
   ```
   
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller opened a new issue, #12458: UTF32toUTF8 can produce automata that produce invalid unicode

Reply via email to