Re: [PR] Core: Fix UnicodeUtil#truncateStringMax returns malformed string. [iceberg]

via GitHub Sun, 22 Sep 2024 21:36:09 -0700


zhongyujiang commented on code in PR #11161:
URL: https://github.com/apache/iceberg/pull/11161#discussion_r1770762980



##########
api/src/main/java/org/apache/iceberg/util/UnicodeUtil.java:
##########
@@ -93,4 +93,24 @@ public static Literal<CharSequence> 
truncateStringMax(Literal<CharSequence> inpu
     }
     return null; // Cannot find a valid upper bound
   }
+
+  private static int incrementCodePoint(int codePoint) {
+    // surrogate code points are not Unicode scalar values,
+    // any UTF-8 byte sequence that would otherwise map to code points 
U+D800..U+DFFF is ill-formed.
+    // see 
https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G27288
+    Preconditions.checkArgument(

Review Comment:
   It is possible for Java strings to contain only one unpaired surrogate 
character(non-unicode character), though encoding them using UTF-8 or UTF-16 
will result in `MalformedInputException`. This is also the case in this issue, 
where the truncation method returns a string ending with an unpaired 
high-surrogate character, but fails when encoding it to UTF-8.
   
   For a valid UTF-8 string, it will not contain unpaired surrogates. However, 
the `codePointAt` 
[method](https://docs.oracle.com/javase/8/docs/api/java/lang/StringBuilder.html#codePointAt-int-)
 may return a unpaired surrogate code point if an incorrect index is passed.
   
   > /**
        * Returns the character (Unicode code point) at the specified
        * index. The index refers to {@code char} values
        * (Unicode code units) and ranges from {@code 0} to
        * {@link #length()}{@code  - 1}.
        *
        * <p> If the {@code char} value specified at the given index
        * is in the high-surrogate range, the following index is less
        * than the length of this sequence, and the
        * {@code char} value at the following index is in the
        * low-surrogate range, then the supplementary code point
        * corresponding to this surrogate pair is returned. Otherwise,
        * the {@code char} value at the given index is returned.
        *
        * @param      index the index to the {@code char} values
        * @return     the code point value of the character at the
        *             {@code index}
        * @throws     IndexOutOfBoundsException  if the {@code index}
        *             argument is negative or not less than the length of this
        *             sequence.
        */
       public int codePointAt(int index) {
   
   Currently, all methods in the `UnicodeUtil` class that use `codePointAt` are 
correct and will not result in an unpaired surrogate code point. I added it to 
strengthen the validation.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Fix UnicodeUtil#truncateStringMax returns malformed string. [iceberg]

Reply via email to