tang-hi commented on issue #12458: URL: https://github.com/apache/lucene/issues/12458#issuecomment-1656725544
I have discovered the bug. ```Java if (endUTF8.numBits(upto) == 5) { // special case -- avoid created unused edges (endUTF8 // doesn't accept certain byte sequences) -- there // are other cases we could optimize too: startCode = 194; } else { startCode = endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]); } ``` In the provided Java code snippet, there is an assumption that the codepoint will start at 0x80 when the number of bits is 6. However, this assumption is incorrect. In reality, when the length of the codepoint is 3 and the first byte is 0xE0, it will start from 0xA0. Similarly, when the length of the codepoint is 4 and the first byte is 0xF0, it will start from 0x90. You can verify this information on the following website: https://www.utf8-chartable.de/unicode-utf8-table.pl Here is the PR to fix that #12472 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org