zhongyujiang commented on code in PR #11161: URL: https://github.com/apache/iceberg/pull/11161#discussion_r1770762980
########## api/src/main/java/org/apache/iceberg/util/UnicodeUtil.java: ########## @@ -93,4 +93,24 @@ public static Literal<CharSequence> truncateStringMax(Literal<CharSequence> inpu } return null; // Cannot find a valid upper bound } + + private static int incrementCodePoint(int codePoint) { + // surrogate code points are not Unicode scalar values, + // any UTF-8 byte sequence that would otherwise map to code points U+D800..U+DFFF is ill-formed. + // see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G27288 + Preconditions.checkArgument( Review Comment: It is possible for Java strings to contain only one unpaired surrogate character(non-unicode character), though encoding them using UTF-8 or UTF-16 will result in `MalformedInputException`. This is also the case in this issue, where the truncation method returns a string ending with an unpaired high-surrogate character, but fails when encoding it to UTF-8. For a valid UTF-8 string, it will not contain unpaired surrogates. However, the `codePointAt` [method](https://docs.oracle.com/javase/8/docs/api/java/lang/StringBuilder.html#codePointAt-int-) may return a unpaired surrogate code point if an incorrect index is passed. > /** * Returns the character (Unicode code point) at the specified * index. The index refers to {@code char} values * (Unicode code units) and ranges from {@code 0} to * {@link #length()}{@code - 1}. * * <p> If the {@code char} value specified at the given index * is in the high-surrogate range, the following index is less * than the length of this sequence, and the * {@code char} value at the following index is in the * low-surrogate range, then the supplementary code point * corresponding to this surrogate pair is returned. Otherwise, * the {@code char} value at the given index is returned. * * @param index the index to the {@code char} values * @return the code point value of the character at the * {@code index} * @throws IndexOutOfBoundsException if the {@code index} * argument is negative or not less than the length of this * sequence. */ public int codePointAt(int index) { Currently, all methods in the `UnicodeUtil` class that use `codePointAt` are correct and will not result in an unpaired surrogate code point. I added it to strengthen the validation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org