zhongyujiang opened a new pull request, #11161: URL: https://github.com/apache/iceberg/pull/11161
We encountered an exception while writing data, and the stack trace is as follows. It occurred during the collection of Parquet column metrics: #### Exception stack: ``` Suppressed: org.apache.iceberg.exceptions.RuntimeIOException: Failed to encode value as UTF-8: Ҋ�Qڞ<֔~�MECڮV? at org.apache.iceberg.types.Conversions.toByteBuffer(Conversions.java:110) at org.apache.iceberg.types.Conversions.toByteBuffer(Conversions.java:83) at org.apache.iceberg.parquet.ParquetUtil.toBufferMap(ParquetUtil.java:343) at org.apache.iceberg.parquet.ParquetUtil.footerMetrics(ParquetUtil.java:174) at org.apache.iceberg.parquet.ParquetUtil.footerMetrics(ParquetUtil.java:86) at org.apache.iceberg.parquet.ParquetWriter.metrics(ParquetWriter.java:166) at org.apache.iceberg.io.DataWriter.close(DataWriter.java:100) at org.apache.iceberg.io.RollingFileWriter.closeCurrentWriter(RollingFileWriter.java:122) at org.apache.iceberg.io.RollingFileWriter.close(RollingFileWriter.java:147) at org.apache.iceberg.io.RollingDataWriter.close(RollingDataWriter.java:32) at org.apache.iceberg.io.FanoutWriter.closeWriters(FanoutWriter.java:82) at org.apache.iceberg.io.FanoutWriter.close(FanoutWriter.java:74) at org.apache.iceberg.io.FanoutDataWriter.close(FanoutDataWriter.java:31) at org.apache.iceberg.spark.source.SparkWrite$PartitionedDataWriter.close(SparkWrite.java:1162) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$9(WriteToDataSourceV2Exec.scala:423) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1496) ... 10 more Caused by: java.nio.charset.MalformedInputException: Input length = 1 at java.nio.charset.CoderResult.throwException(CoderResult.java:281) at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:816) at org.apache.iceberg.types.Conversions.toByteBuffer(Conversions.java:108) ... 25 more Caused by: java.nio.charset.MalformedInputException: Input length = 1 at java.nio.charset.CoderResult.throwException(CoderResult.java:281) at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:816) at org.apache.iceberg.types.Conversions.toByteBuffer(Conversions.java:108) ``` #### Investigation After some investigation, I found that when collecting Parquet column metrics, string metrics are truncated by default to a length of 16 characters. When truncating the max metric, if the truncated length is less than the original max value, the last character is incremented by 1 to ensure that the truncated value is greater than the max value. However, this increment operation did not consider skipping illegal UTF-8 Unicode code points, which led to the following exception. In the scenario where we encountered this issue, there is a Parquet file with a column's max metric length exceeding 16, and the code point of its 16th character is '\uD7FF', which is `Character.MIN_SURROGATE - 1`. Adding 1 to this resulted in `Character.MIN_SURROGATE`, which is not a [valid Unicode scalar value](https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G27288). Therefore, when `Conversions.toByteBuffer` attempted to encode it in UTF-8 format, a `MalformedInputException` was thrown. This fix specifically skips illegal code points when incrementing the last character to avoid this issue. #### To reproduce ```sql CREATE TABLE my_table (data string) using iceberg; INSERT INTO my_table VALUES('abcdefghigklmno\uD7FFp'); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org