[PR] Core: Fix UnicodeUtil#truncateStringMax returns malformed string. [iceberg]

via GitHub Wed, 18 Sep 2024 07:32:50 -0700


zhongyujiang opened a new pull request, #11161:
URL: https://github.com/apache/iceberg/pull/11161


   We encountered an exception while writing data, and the stack trace is as 
follows. It occurred during the collection of Parquet column metrics:
   #### Exception stack:
   ``` 
   Suppressed: org.apache.iceberg.exceptions.RuntimeIOException: Failed to 
encode value as UTF-8: Ҋ�Qڞ<֔~�MECڮV?
   at org.apache.iceberg.types.Conversions.toByteBuffer(Conversions.java:110)
   at org.apache.iceberg.types.Conversions.toByteBuffer(Conversions.java:83)
   at org.apache.iceberg.parquet.ParquetUtil.toBufferMap(ParquetUtil.java:343)
   at org.apache.iceberg.parquet.ParquetUtil.footerMetrics(ParquetUtil.java:174)
   at org.apache.iceberg.parquet.ParquetUtil.footerMetrics(ParquetUtil.java:86)
   at org.apache.iceberg.parquet.ParquetWriter.metrics(ParquetWriter.java:166)
   at org.apache.iceberg.io.DataWriter.close(DataWriter.java:100)
   at 
org.apache.iceberg.io.RollingFileWriter.closeCurrentWriter(RollingFileWriter.java:122)
   at org.apache.iceberg.io.RollingFileWriter.close(RollingFileWriter.java:147)
   at org.apache.iceberg.io.RollingDataWriter.close(RollingDataWriter.java:32)
   at org.apache.iceberg.io.FanoutWriter.closeWriters(FanoutWriter.java:82)
   at org.apache.iceberg.io.FanoutWriter.close(FanoutWriter.java:74)
   at org.apache.iceberg.io.FanoutDataWriter.close(FanoutDataWriter.java:31)
   at 
org.apache.iceberg.spark.source.SparkWrite$PartitionedDataWriter.close(SparkWrite.java:1162)
   at 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$9(WriteToDataSourceV2Exec.scala:423)
   at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1496)
   ... 10 more
   Caused by: java.nio.charset.MalformedInputException: Input length = 1
   at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
   at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:816)
   at org.apache.iceberg.types.Conversions.toByteBuffer(Conversions.java:108)
   ... 25 more
   Caused by: java.nio.charset.MalformedInputException: Input length = 1
   at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
   at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:816)
   at org.apache.iceberg.types.Conversions.toByteBuffer(Conversions.java:108)
   ```
   
   #### Investigation
   After some investigation, I found that when collecting Parquet column 
metrics, string metrics are truncated by default to a length of 16 characters. 
When truncating the max metric, if the truncated length is less than the 
original max value, the last character is incremented by 1 to ensure that the 
truncated value is greater than the max value. However, this increment 
operation did not consider skipping illegal UTF-8 Unicode code points, which 
led to the following exception.
   
   In the scenario where we encountered this issue, there is a Parquet file 
with a column's max metric length exceeding 16, and the code point of its 16th 
character is '\uD7FF', which is `Character.MIN_SURROGATE - 1`. Adding 1 to this 
resulted in `Character.MIN_SURROGATE`, which is not a [valid Unicode scalar 
value](https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G27288).
 Therefore, when `Conversions.toByteBuffer` attempted to encode it in UTF-8 
format, a `MalformedInputException` was thrown.
   
   This fix specifically skips illegal code points when incrementing the last 
character to avoid this issue.
   
   #### To reproduce
   ```sql
   CREATE TABLE my_table (data string) using iceberg;
   INSERT INTO my_table VALUES('abcdefghigklmno\uD7FFp');
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Core: Fix UnicodeUtil#truncateStringMax returns malformed string. [iceberg]

Reply via email to