tang-hi commented on issue #12458:
URL: https://github.com/apache/lucene/issues/12458#issuecomment-1656725544

    I have discovered the bug. 
   ```Java
   if (endUTF8.numBits(upto) == 5) {
           // special case -- avoid created unused edges (endUTF8
           // doesn't accept certain byte sequences) -- there
           // are other cases we could optimize too:
           startCode = 194;
         } else {
           startCode = endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 
1]);
    }
   ```
   In the provided Java code snippet, there is an assumption that the codepoint 
will start at 0x80 when the number of bits is 6. However, this assumption is 
incorrect. In reality, when the length of the codepoint is 3 and the first byte 
is 0xE0, it will start from 0xA0. Similarly, when the length of the codepoint 
is 4 and the first byte is 0xF0, it will start from 0x90. You can verify this 
information on the following website: 
https://www.utf8-chartable.de/unicode-utf8-table.pl
   Here is the PR to fix that #12472 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to