mikemccand commented on code in PR #12472: URL: https://github.com/apache/lucene/pull/12472#discussion_r1280433424
########## lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java: ########## @@ -232,12 +232,18 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]), endUTF8.byteAt(upto)); } else { + // There are three special case + // 1. if codepoint's numBits is 5, byte start from 0xC2 + // 2. if codepoint's len is 3 and first byte is 0xE0. the second byte start from 0xA0 + // 3. if codepoint's len is 4 and first byte is 0xF0, the second byte start from 0x90 + // you could found the reference in https://www.utf8-chartable.de/unicode-utf8-table.pl final int startCode; if (endUTF8.numBits(upto) == 5) { - // special case -- avoid created unused edges (endUTF8 - // doesn't accept certain byte sequences) -- there - // are other cases we could optimize too: - startCode = 194; + startCode = 0xC2; + } else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) { + startCode = 0xA0; + } else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) { + startCode = 0x90; } else { startCode = endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]); Review Comment: I wonder why `MASKS` is size 32? We only access it with `numBits` -- shouldn't it only be length 7? Separately we should make it 1-based so access doesn't have to keep subtracting 1 ... might eek out a bit of CPU. But let's fix that later/separately... ########## lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java: ########## @@ -232,12 +232,18 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]), endUTF8.byteAt(upto)); } else { + // There are three special case + // 1. if codepoint's numBits is 5, byte start from 0xC2 + // 2. if codepoint's len is 3 and first byte is 0xE0. the second byte start from 0xA0 + // 3. if codepoint's len is 4 and first byte is 0xF0, the second byte start from 0x90 + // you could found the reference in https://www.utf8-chartable.de/unicode-utf8-table.pl final int startCode; if (endUTF8.numBits(upto) == 5) { - // special case -- avoid created unused edges (endUTF8 - // doesn't accept certain byte sequences) -- there - // are other cases we could optimize too: - startCode = 194; + startCode = 0xC2; + } else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) { + startCode = 0xA0; + } else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) { + startCode = 0x90; Review Comment: Hmm should we also set `0x80` in the `upto == 2` case and `endUTF8.len == 4` when the leading bytes are `0xF0 0x90`? But how come no test is failing, if that's right? Edit: OK, I see, the existing code will in fact set `startCode = 0x80` already in that case, so, no bug, phew. And this same logic happens to work for 2nd byte of 1 -> 2 UTF8 length transition as well (also 0x80, which the first `if` handles correctly today). Tricky! ########## lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java: ########## @@ -232,12 +232,18 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]), endUTF8.byteAt(upto)); } else { + // There are three special case + // 1. if codepoint's numBits is 5, byte start from 0xC2 + // 2. if codepoint's len is 3 and first byte is 0xE0. the second byte start from 0xA0 Review Comment: Extra space before `and` and `first`? ########## lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java: ########## @@ -232,12 +232,18 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]), endUTF8.byteAt(upto)); } else { + // There are three special case + // 1. if codepoint's numBits is 5, byte start from 0xC2 + // 2. if codepoint's len is 3 and first byte is 0xE0. the second byte start from 0xA0 + // 3. if codepoint's len is 4 and first byte is 0xF0, the second byte start from 0x90 + // you could found the reference in https://www.utf8-chartable.de/unicode-utf8-table.pl final int startCode; if (endUTF8.numBits(upto) == 5) { - // special case -- avoid created unused edges (endUTF8 - // doesn't accept certain byte sequences) -- there - // are other cases we could optimize too: - startCode = 194; + startCode = 0xC2; + } else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) { + startCode = 0xA0; Review Comment: Can you move the above three bullets into each of these internal `if`s to make the special casing clearer? E.g. in this `if` maybe say: ``` // the first length=3 UTF8 Unicode character is E0 A0 80 so we must special case A0 as the 2nd byte when E0 was the first byte of endUTF8 in this case ``` ########## lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java: ########## @@ -232,12 +232,18 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]), endUTF8.byteAt(upto)); } else { + // There are three special case + // 1. if codepoint's numBits is 5, byte start from 0xC2 Review Comment: Instead of saying `if a codepoint's numBits is 5`, could we say `if a codepoint encodes to 2 UTF-8 bytes`? And same for bullets 2 & 3 (3 UTF-8 bytes, 4 UTF-8 bytes)? ########## lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java: ########## @@ -232,12 +232,18 @@ private void end(int start, int end, UTF8Sequence endUTF8, int upto, boolean doA endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]), endUTF8.byteAt(upto)); } else { + // There are three special case + // 1. if codepoint's numBits is 5, byte start from 0xC2 + // 2. if codepoint's len is 3 and first byte is 0xE0. the second byte start from 0xA0 + // 3. if codepoint's len is 4 and first byte is 0xF0, the second byte start from 0x90 + // you could found the reference in https://www.utf8-chartable.de/unicode-utf8-table.pl final int startCode; if (endUTF8.numBits(upto) == 5) { Review Comment: For consistency, could we change this to: ``` if (endUTF8.len == 2) { assert upto == 0; // the upto==1 case will be handled by the first if above startCode = 0xC2; } ``` ? ########## lucene/CHANGES.txt: ########## @@ -182,6 +182,10 @@ Bug Fixes * GITHUB#12451: Change TestStringsToAutomaton validation to avoid automaton conversion bug discovered in GH#12458 (Greg Miller). +* GITHUB#2472: Fix the incorrect conversion from a Unicode (UTF-32) automaton to its UTF-8 representation. Review Comment: Could we reword maybe to: ``` UTF32ToUTF8 would sometimes accept extra invalid UTF-8 binary sequences. This should not have any user impact except in the very unusual possible use-case of searching a non-UTF-8 (fully binary terms) inverted field using Unicode search terms. ``` ? Net/net I don't think this will impact any users ... users who index binary content would search it with binary terms (not via conversion from Unicode, where this bug lurks) unless they had a truly exotic use case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org