supsupsap opened a new issue, #9674: URL: https://github.com/apache/iceberg/issues/9674
### Apache Iceberg version 1.4.3 (latest release) ### Query engine Spark ### Please describe the bug 🐞 Hi all! I have a problem with Iceberg Rewrite DataFiles unmanageable behavior. **ENV:** > Spark 3.5.0 in k8s > Spark config - ({ "driver_cores": "6", "driver_memory": "40Gi", "executor_cores": "2", "executor_memory": "50Gi", "num_executors": "30}) > Iceberg 1.4.2 > S3 Rewrite datafiles procedure config - ``` strategy="binpack", options={ "rewrite-job-order": "bytes-asc", "max-file-group-size-bytes": f"{1024*1024*1024}", "target-file-size-bytes": f"{512*1024*1024}", "min-input-files": "5", "max-concurrent-file-group-rewrites": "30", "partial-progress.enabled": "true", "partial-progress.max-commits": "20", } ``` The fact is that 3 times running procedures with the same parameters on the same table gave different results and different errors, I can't understand what I'm doing wrong, maybe someone can explain to me what's the matter? **TLDR - The first run did not produce errors related to overwriting files and did not overwrite all files which were good for rewrite. Subsequent runs overwritten more files than the first one on the same table. And generated errors related to Premature end of Content-Length** 3 jobs were run with the same parameters on a partitioned table with many small files. And here are the results I got... (No new data was written to this table between the runs of this procedure) **1st try:** > duration - 83 minutes > rewritten_data_files_count = 100638 > added_data_files_count = 52 > rewritten_bytes_count = 17814636741 > failed_data_files_count = 0 (!!!) **Errors/Warnings:** It were many warnings like this: ``` 2024-02-01T14:09:25,893 WARN [Rewrite-Service-15] org.apache.iceberg.util.Tasks - Retrying task after failure: Cannot commit: stale table metadata org.apache.iceberg.exceptions.CommitFailedException: Cannot commit: stale table metadata ``` 2) ``` 2024-02-01T14:13:39,857 WARN [Rewrite-Service-4] org.apache.hadoop.hive.metastore.RetryingMetaStoreClient - MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. getTable org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset by peer (Write failed) ... Caused by: java.net.SocketException: Connection reset by peer (Write failed) ``` 3) `2024-02-01T14:13:39,958 WARN [Rewrite-Service-6] org.apache.iceberg.util.Tasks - Retrying task after failure: Base metadata location 's3a://obs-zdp-warehouse-stage-mz/stage.db/test_partitioned/metadata/15461-ff3d5cc3-b0bb-4847-a433-79526c16e22a.metadata.json' is not same as the current table metadata location 's3a://obs-zdp-warehouse-stage-mz/stage.db/test_partitioned/metadata/15462-262e2ad3-9bd9-409f-bbc1-b1fefa891477.metadata.json' for stage.test_partitioned` 4) ``` 2024-02-01T14:13:49,268 ERROR [Rewrite-Service-24] org.apache.iceberg.actions.RewriteDataFilesCommitManager - Cannot commit groups [RewriteFileGroup{info=FileGroupInfo{globalIndex=8, partitionIndex=1, partition=PartitionData{date=19753, date2=19752, event_type=stupid_event}}, numRewrittenFiles=6062, numAddedFiles=3, numRewrittenBytes=1073690623}, RewriteFileGroup{info=FileGroupInfo{globalIndex=3, partitionIndex=1, partition=PartitionData{date=19753, date2=19753, event_type=neutral_event}}, numRewrittenFiles=6049, numAddedFiles=3, numRewrittenBytes=1073712029}, RewriteFileGroup{info=FileGroupInfo{globalIndex=5, partitionIndex=1, partition=PartitionData{date=19753, date2=19752, event_type=neutral_event}}, numRewrittenFiles=6062, numAddedFiles=3, numRewrittenBytes=1073741011}], attempting to clean up written files org.apache.iceberg.exceptions.CommitFailedException: Cannot commit: stale table metadata ``` 5) ``` 2024-02-01T14:19:12,950 WARN [ForkJoinPool.commonPool-worker-1] org.apache.thrift.transport.TIOStreamTransport - Error closing output stream. java.net.SocketException: Socket closed ``` **Summarizing the first launch -** The first launch demonstrated a long duration and a high number of rewritten files. However, it was notable for not having any failed files. Also there were several warning and error messages related to stale table metadata, connection issues, and socket closure. Looking at the output, there were no errors in the end with overwriting files. And for a new snapshot we have around 180k files. **We looked inside s3 and saw that there are still a lot of small files in some partitions - So we run it a second time.** **2nd try:** > duration - 25 minutes > rewritten_data_files_count = 75415 - (what? why? Surely nothing here should have been overwritten? because of 1st try and 0 in failed_data_files_count?) > added_data_files_count = 58 > rewritten_bytes_count = 13346723044 > failed_data_files_count = 9463 (Now we have failed files, I remind you that nothing was added to the table between the runs) Errors/Warnings: It were many errors like this: ``` 2024-02-01T14:47:34,693 INFO [Rewrite-Service-26] org.apache.spark.scheduler.DAGScheduler - Job 18 failed: save at SparkBinPackDataRewriter.java:62, took 316.504764 s 2024-02-01T14:47:34,694 ERROR [Rewrite-Service-26] org.apache.spark.sql.execution.datasources.v2.AppendDataExec - Data source write support IcebergBatchWrite(table=iceberg_hive.stage.test_partitioned, format=PARQUET) is aborting. 2024-02-01T14:47:34,700 INFO [Rewrite-Service-26] org.apache.iceberg.spark.source.SparkCleanupUtil - Deleted 0 file(s) using bulk deletes (job abort) 2024-02-01T14:47:34,700 ERROR [Rewrite-Service-26] org.apache.spark.sql.execution.datasources.v2.AppendDataExec - Data source write support IcebergBatchWrite(table=iceberg_hive.stage.test_partitioned, format=PARQUET) aborted. 2024-02-01T14:47:34,701 ERROR [Rewrite-Service-26] org.apache.iceberg.spark.actions.RewriteDataFilesSparkAction - Failure during rewrite group FileGroupInfo{globalIndex=21, partitionIndex=1, partition=PartitionData{date=19753, date2=19752, event_type=neutral_event}} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23.0 failed 4 times, most recent failure: Lost task 0.3 in stage 23.0 (TID 67) (10.16.37.136 executor 5): org.apache.iceberg.exceptions.RuntimeIOException: org.apache.iceberg.aws.shaded.org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 233,561,838; received: 143,761,408) ... Caused by: org.apache.iceberg.aws.shaded.org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 233,561,838; received: 143,761,408) ``` **summarizing the 2nd launch:** A lot of files were overwritten, and we received a much smaller number of files compared to the first launch. At the same time, we also received many errors regarding overwriting files and all related to: Premature end of Content-Length… **3rd try:** > duration - 13 minutes > rewritten_data_files_count = 2615 - (and again, we have overwritten files, much smaller, but there are, now at least it is clear from where that this is part of the files from failed in the second attempt) > added_data_files_count = 2 > rewritten_bytes_count = 463745143 > failed_data_files_count = 6885 Errors It were many errors like this: `[February 1, 2024 8:14 PM] {spark_submit_k8s_operator.py:67} INFO - 2024-02-01T15:14:47,097 WARN [task-result-getter-0] org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 5.0 (TID 13) (10.16.33.130 executor 25): org.apache.iceberg.exceptions.RuntimeIOException: org.apache.iceberg.aws.shaded.org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 296,640,009; received: 144,359,424)` **summarizing the 3rd launch:** Some were able to overwrite, some were not. And again problems related to Premature end of Content-Length… Full Stacktrace of Premature end of Content-Length: ``` [2024-02-07, 07:48:05 UTC] {spark_submit_k8s_operator.py:67} INFO - 2024-02-07T07:48:01,330 INFO [dispatcher-CoarseGrainedScheduler] org.apache.spark.scheduler.TaskSetManager - Starting task 1.0 in stage 6.0 (TID 10) (10.16.35.80, executor 1, partition 1, PROCESS_LOCAL, 14586 bytes) 2024-02-07T07:48:01,339 WARN [task-result-getter-2] org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 5.0 (TID 0) (10.16.35.80 executor 1): org.apache.iceberg.exceptions.RuntimeIOException: org.apache.iceberg.aws.shaded.org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 297,221,738; received: 144,236,544) at org.apache.iceberg.parquet.ParquetReader$FileIterator.advance(ParquetReader.java:149) at org.apache.iceberg.parquet.ParquetReader$FileIterator.next(ParquetReader.java:126) at org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:65) at org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:49) at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:135) at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:120) at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:158) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63) at scala.Option.exists(Option.scala:376) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.$anonfun$run$1(WriteToDataSourceV2Exec.scala:441) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1397) at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run(WriteToDataSourceV2Exec.scala:486) at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run$(WriteToDataSourceV2Exec.scala:425) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:491) at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:388) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:141) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.apache.iceberg.aws.shaded.org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 297,221,738; received: 144,236,544) at org.apache.iceberg.aws.shaded.org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178) at org.apache.iceberg.aws.shaded.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135) at java.base/java.io.FilterInputStream.read(FilterInputStream.java:133) at java.base/java.io.FilterInputStream.read(FilterInputStream.java:133) at software.amazon.awssdk.core.io.SdkFilterInputStream.read(SdkFilterInputStream.java:66) at org.apache.iceberg.aws.s3.S3InputStream.read(S3InputStream.java:109) at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:102) at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) at org.apache.iceberg.shaded.org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1850) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:990) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:940) at org.apache.iceberg.parquet.ParquetReader$FileIterator.advance(ParquetReader.java:147) ... 32 more ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org