wardlican opened a new issue, #11614: URL: https://github.com/apache/iceberg/issues/11614
### Apache Iceberg version 1.4.3 ### Query engine Spark ### Please describe the bug 🐞 ``` CALL spark_catalog.system.rewrite_data_files( table => '${DATABASE_NAME}.${TABLE_NAME}', options => map( 'max-concurrent-file-group-rewrites', 500, 'target-file-size-bytes','536870912', 'max-file-group-size-bytes','10737418240', 'rewrite-all', 'true') ); ``` After using spark_catalog.system.rewrite_data_files to merge iceberg small files, the new parquet generated encountered an unreadable problem. The error message is as follows ``` client token: N/A diagnostics: User class threw exception: java.lang.RuntimeException: Job aborted due to stage failure: Task 208 in stage 7.0 failed 4 times, most recent failure: Lost task 208.3 in stage 7.0 (TID 272) (10.1.75.103 executor 9): org.apache.iceberg.exceptions.RuntimeIOException: java.io.IOException: can not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field 'num_values' was not found in serialized data! Struct: org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme@57eb7595 at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:165) at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:141) at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:130) at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:93) at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:130) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.agg_doAggregateWithKeys_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1501) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.io.IOException: can not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field 'num_values' was not found in serialized data! Struct: org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme@57eb7595 at org.apache.iceberg.shaded.org.apache.parquet.format.Util.read(Util.java:366) at org.apache.iceberg.shaded.org.apache.parquet.format.Util.readPageHeader(Util.java:133) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:1458) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1505) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1478) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1088) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:956) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:909) at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163) ... 23 more ``` ### Willingness to contribute - [X] I can contribute a fix for this bug independently - [X] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [X] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org