[GitHub] [lucene] mikemccand commented on a diff in pull request #12472: Fix UTF32toUTF8 will produce invalid transition

via GitHub Tue, 01 Aug 2023 05:52:38 -0700


mikemccand commented on code in PR #12472:
URL: https://github.com/apache/lucene/pull/12472#discussion_r1280433424



##########
lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java:
##########
@@ -232,12 +232,18 @@ private void end(int start, int end, UTF8Sequence 
endUTF8, int upto, boolean doA
           endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]),
           endUTF8.byteAt(upto));
     } else {
+      // There are three special case
+      // 1. if codepoint's numBits is 5, byte start from 0xC2
+      // 2. if codepoint's len is 3  and  first byte is 0xE0. the second byte 
start from 0xA0
+      // 3. if codepoint's len is 4 and first byte is 0xF0, the second byte 
start from 0x90
+      // you could found the reference in 
https://www.utf8-chartable.de/unicode-utf8-table.pl
       final int startCode;
       if (endUTF8.numBits(upto) == 5) {
-        // special case -- avoid created unused edges (endUTF8
-        // doesn't accept certain byte sequences) -- there
-        // are other cases we could optimize too:
-        startCode = 194;
+        startCode = 0xC2;
+      } else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) {
+        startCode = 0xA0;
+      } else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) {
+        startCode = 0x90;
       } else {
         startCode = endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]);

Review Comment:
   I wonder why `MASKS` is size 32?  We only access it with `numBits` -- 
shouldn't it only be length 7?  Separately we should make it 1-based so access 
doesn't have to keep subtracting 1 ... might eek out a bit of CPU.  But let's 
fix that later/separately...



##########
lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java:
##########
@@ -232,12 +232,18 @@ private void end(int start, int end, UTF8Sequence 
endUTF8, int upto, boolean doA
           endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]),
           endUTF8.byteAt(upto));
     } else {
+      // There are three special case
+      // 1. if codepoint's numBits is 5, byte start from 0xC2
+      // 2. if codepoint's len is 3  and  first byte is 0xE0. the second byte 
start from 0xA0
+      // 3. if codepoint's len is 4 and first byte is 0xF0, the second byte 
start from 0x90
+      // you could found the reference in 
https://www.utf8-chartable.de/unicode-utf8-table.pl
       final int startCode;
       if (endUTF8.numBits(upto) == 5) {
-        // special case -- avoid created unused edges (endUTF8
-        // doesn't accept certain byte sequences) -- there
-        // are other cases we could optimize too:
-        startCode = 194;
+        startCode = 0xC2;
+      } else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) {
+        startCode = 0xA0;
+      } else if (endUTF8.len == 4 && upto == 1 && endUTF8.byteAt(0) == 0xF0) {
+        startCode = 0x90;

Review Comment:
   Hmm should we also set `0x80` in the `upto == 2` case and `endUTF8.len == 4` 
when the leading bytes are `0xF0 0x90`?  But how come no test is failing, if 
that's right?
   
   Edit: OK, I see, the existing code will in fact set `startCode = 0x80` 
already in that case, so, no bug, phew.  And this same logic happens to work 
for 2nd byte of 1 -> 2 UTF8 length transition as well (also 0x80, which the 
first `if` handles correctly today).  Tricky!



##########
lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java:
##########
@@ -232,12 +232,18 @@ private void end(int start, int end, UTF8Sequence 
endUTF8, int upto, boolean doA
           endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]),
           endUTF8.byteAt(upto));
     } else {
+      // There are three special case
+      // 1. if codepoint's numBits is 5, byte start from 0xC2
+      // 2. if codepoint's len is 3  and  first byte is 0xE0. the second byte 
start from 0xA0

Review Comment:
   Extra space before `and` and `first`?



##########
lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java:
##########
@@ -232,12 +232,18 @@ private void end(int start, int end, UTF8Sequence 
endUTF8, int upto, boolean doA
           endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]),
           endUTF8.byteAt(upto));
     } else {
+      // There are three special case
+      // 1. if codepoint's numBits is 5, byte start from 0xC2
+      // 2. if codepoint's len is 3  and  first byte is 0xE0. the second byte 
start from 0xA0
+      // 3. if codepoint's len is 4 and first byte is 0xF0, the second byte 
start from 0x90
+      // you could found the reference in 
https://www.utf8-chartable.de/unicode-utf8-table.pl
       final int startCode;
       if (endUTF8.numBits(upto) == 5) {
-        // special case -- avoid created unused edges (endUTF8
-        // doesn't accept certain byte sequences) -- there
-        // are other cases we could optimize too:
-        startCode = 194;
+        startCode = 0xC2;
+      } else if (endUTF8.len == 3 && upto == 1 && endUTF8.byteAt(0) == 0xE0) {
+        startCode = 0xA0;

Review Comment:
   Can you move the above three bullets into each of these internal `if`s to 
make the special casing clearer?
   
   E.g. in this `if` maybe say:
   
   ```
   // the first length=3 UTF8 Unicode character is E0 A0 80 so we must special 
case A0 as the 2nd byte when E0 was the first byte of endUTF8 in this case
   ```



##########
lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java:
##########
@@ -232,12 +232,18 @@ private void end(int start, int end, UTF8Sequence 
endUTF8, int upto, boolean doA
           endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]),
           endUTF8.byteAt(upto));
     } else {
+      // There are three special case
+      // 1. if codepoint's numBits is 5, byte start from 0xC2

Review Comment:
   Instead of saying `if a codepoint's numBits is 5`, could we say `if a 
codepoint encodes to 2 UTF-8 bytes`?  And same for bullets 2 & 3 (3 UTF-8 
bytes, 4 UTF-8 bytes)?



##########
lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java:
##########
@@ -232,12 +232,18 @@ private void end(int start, int end, UTF8Sequence 
endUTF8, int upto, boolean doA
           endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]),
           endUTF8.byteAt(upto));
     } else {
+      // There are three special case
+      // 1. if codepoint's numBits is 5, byte start from 0xC2
+      // 2. if codepoint's len is 3  and  first byte is 0xE0. the second byte 
start from 0xA0
+      // 3. if codepoint's len is 4 and first byte is 0xF0, the second byte 
start from 0x90
+      // you could found the reference in 
https://www.utf8-chartable.de/unicode-utf8-table.pl
       final int startCode;
       if (endUTF8.numBits(upto) == 5) {

Review Comment:
   For consistency, could we change this to:
   
   ```
       if (endUTF8.len == 2) {
           assert upto == 0;   // the upto==1 case will be handled by the first 
if above
           startCode = 0xC2;
       }
   ```
   
   ?



##########
lucene/CHANGES.txt:
##########
@@ -182,6 +182,10 @@ Bug Fixes
 * GITHUB#12451: Change TestStringsToAutomaton validation to avoid automaton 
conversion bug discovered in GH#12458
   (Greg Miller).
 
+* GITHUB#2472: Fix the incorrect conversion from a Unicode (UTF-32) automaton 
to its UTF-8 representation.

Review Comment:
   Could we reword maybe to:
   
   ```
   UTF32ToUTF8 would sometimes accept extra invalid UTF-8 binary sequences.  
This should not have any user impact except in the very unusual possible 
use-case of searching a non-UTF-8 (fully binary terms) inverted field using 
Unicode search terms.
   ```
   
   ?
   
   Net/net I don't think this will impact any users ... users who index binary 
content would search it with binary terms (not via conversion from Unicode, 
where this bug lurks) unless they had a truly exotic use case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a diff in pull request #12472: Fix UTF32toUTF8 will produce invalid transition

Reply via email to