tang-hi commented on issue #12458:
URL: https://github.com/apache/lucene/issues/12458#issuecomment-1656725544
I have discovered the bug.
```Java
if (endUTF8.numBits(upto) == 5) {
// special case -- avoid created unused edges (endUTF8
// doesn't accept certain byte sequences) -- there
// are other cases we could optimize too:
startCode = 194;
} else {
startCode = endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) -
1]);
}
```
In the provided Java code snippet, there is an assumption that the codepoint
will start at 0x80 when the number of bits is 6. However, this assumption is
incorrect. In reality, when the length of the codepoint is 3 and the first byte
is 0xE0, it will start from 0xA0. Similarly, when the length of the codepoint
is 4 and the first byte is 0xF0, it will start from 0x90. You can verify this
information on the following website:
https://www.utf8-chartable.de/unicode-utf8-table.pl
Here is the PR to fix that #12472
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]