CodingJun opened a new issue, #10312: URL: https://github.com/apache/iceberg/issues/10312
### Apache Iceberg version 1.5.1 ### Query engine Spark ### Please describe the bug 🐞 I have a program that continuously write streaming data to iceberg, and regularly use spark to compact data files. But I found that after compact the data files, some of the data was not deleted correctly. The following are the examples to reproduce: Original table: |id|value| |---|---| |1|a| |2|b| |3|c| Writing process: - t1: Thread 1 start compact data files with RewriteDataFilesSparkAction. (start snapshot-id: 1, start sequence-number: 1) - t2: Thread 2 write equality delete, id = 2. (snapshot-id: 2, sequence-number: 2) - t3: Thread 2 append new data, [4, d]. (snapshot-id: 3, sequence-number: 3) - t4: Thread 1 compact data files completed. (snapshot-id: 4, sequence-number: 4) Result: |id|value| |---|---| |1|a| |2|b| |3|c| |4|d| The correct result should be: |id|value| |---|---| |1|a| |3|c| |4|d| PS: When I set `use-starting-sequence-number = false` for rewriteDataFiles, Thread 1 compact data files failed at t4. stacktrace: ``` Caused by: org.apache.iceberg.exceptions.ValidationException: Cannot commit, found new delete for replaced data file: GenericDataFile{content=data, file_path=/var/folders/5z/dqrlv_ts0wqf36vd39bb384h0000gn/T/junit17491575750166086656/9f77fae8-d62a-426d-971f-a342b6775c44/test_schema/test_table/data/00000-2-52ae94aa-b796-4c42-bf9c-92d36c52e522-00001.parquet, file_format=PARQUET, spec_id=0, partition=PartitionData{}, record_count=1, file_size_in_bytes=407, column_sizes=null, value_counts=org.apache.iceberg.util.SerializableMap@0, null_value_counts=org.apache.iceberg.util.SerializableMap@1, nan_value_counts=org.apache.iceberg.util.SerializableMap@0, lower_bounds=org.apache.iceberg.SerializableByteBufferMap@e1782, upper_bounds=org.apache.iceberg.SerializableByteBufferMap@e1782, key_metadata=null, split_offsets=[4], equality_ids=null, sort_order_id=null} at org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:50) at org.apache.iceberg.MergingSnapshotProducer.validateNoNewDeletesForDataFiles(MergingSnapshotProducer.java:418) at org.apache.iceberg.MergingSnapshotProducer.validateNoNewDeletesForDataFiles(MergingSnapshotProducer.java:367) at org.apache.iceberg.BaseRewriteFiles.validate(BaseRewriteFiles.java:108) at org.apache.iceberg.SnapshotProducer.apply(SnapshotProducer.java:175) at org.apache.iceberg.SnapshotProducer.lambda$commit$2(SnapshotProducer.java:296) at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:404) at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:214) at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:198) at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:190) at org.apache.iceberg.SnapshotProducer.commit(SnapshotProducer.java:295) at org.apache.iceberg.actions.RewriteDataFilesCommitManager.commitFileGroups(RewriteDataFilesCommitManager.java:89) at org.apache.iceberg.actions.RewriteDataFilesCommitManager.commitOrClean(RewriteDataFilesCommitManager.java:110) at org.apache.iceberg.spark.actions.RewriteDataFilesSparkAction.doExecute(RewriteDataFilesSparkAction.java:291) ... 8 more ``` Question: Why are the equality deleted files lost? Is this correct or a bug? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org