[GitHub] [lucene] tang-hi commented on issue #12458: UTF32toUTF8 can create automata that produce/accept invalid unicode

via GitHub Sat, 29 Jul 2023 06:02:28 -0700


tang-hi commented on issue #12458:
URL: https://github.com/apache/lucene/issues/12458#issuecomment-1656725544


    I have discovered the bug. 
   ```Java
   if (endUTF8.numBits(upto) == 5) {
           // special case -- avoid created unused edges (endUTF8
           // doesn't accept certain byte sequences) -- there
           // are other cases we could optimize too:
           startCode = 194;
         } else {
           startCode = endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 
1]);
    }
   ```
   In the provided Java code snippet, there is an assumption that the codepoint 
will start at 0x80 when the number of bits is 6. However, this assumption is 
incorrect. In reality, when the length of the codepoint is 3 and the first byte 
is 0xE0, it will start from 0xA0. Similarly, when the length of the codepoint 
is 4 and the first byte is 0xF0, it will start from 0x90. You can verify this 
information on the following website: 
https://www.utf8-chartable.de/unicode-utf8-table.pl
   Here is the PR to fix that #12472 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] tang-hi commented on issue #12458: UTF32toUTF8 can create automata that produce/accept invalid unicode

Reply via email to