[I] Equality delete files lost after compact data files [iceberg]

via GitHub Fri, 10 May 2024 20:17:25 -0700


CodingJun opened a new issue, #10312:
URL: https://github.com/apache/iceberg/issues/10312


   ### Apache Iceberg version
   
   1.5.1
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I have a program that continuously write streaming data to iceberg, and 
regularly use spark to compact data files. But I found that after compact the 
data files, some of the data was not deleted correctly. The following are the 
examples to reproduce:
   
   Original table:
   |id|value|
   |---|---|
   |1|a|
   |2|b|
   |3|c|
   
   Writing process:
   - t1: Thread 1 start compact data files with RewriteDataFilesSparkAction. 
(start snapshot-id: 1, start sequence-number: 1)
   - t2: Thread 2 write equality delete, id = 2. (snapshot-id: 2, 
sequence-number: 2)
   - t3: Thread 2 append new data, [4, d]. (snapshot-id: 3, sequence-number: 3)
   - t4: Thread 1 compact data files completed. (snapshot-id: 4, 
sequence-number: 4)
   
   Result:
   |id|value|
   |---|---|
   |1|a|
   |2|b|
   |3|c|
   |4|d|
   
   The correct result should be:
   |id|value|
   |---|---|
   |1|a|
   |3|c|
   |4|d|
   
   PS:
   When I set `use-starting-sequence-number = false` for rewriteDataFiles, 
Thread 1 compact data files failed at t4. stacktrace:
   ```
   Caused by: org.apache.iceberg.exceptions.ValidationException: Cannot commit, 
found new delete for replaced data file: GenericDataFile{content=data, 
file_path=/var/folders/5z/dqrlv_ts0wqf36vd39bb384h0000gn/T/junit17491575750166086656/9f77fae8-d62a-426d-971f-a342b6775c44/test_schema/test_table/data/00000-2-52ae94aa-b796-4c42-bf9c-92d36c52e522-00001.parquet,
 file_format=PARQUET, spec_id=0, partition=PartitionData{}, record_count=1, 
file_size_in_bytes=407, column_sizes=null, 
value_counts=org.apache.iceberg.util.SerializableMap@0, 
null_value_counts=org.apache.iceberg.util.SerializableMap@1, 
nan_value_counts=org.apache.iceberg.util.SerializableMap@0, 
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@e1782, 
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@e1782, 
key_metadata=null, split_offsets=[4], equality_ids=null, sort_order_id=null}
        at 
org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:50)
        at 
org.apache.iceberg.MergingSnapshotProducer.validateNoNewDeletesForDataFiles(MergingSnapshotProducer.java:418)
        at 
org.apache.iceberg.MergingSnapshotProducer.validateNoNewDeletesForDataFiles(MergingSnapshotProducer.java:367)
        at 
org.apache.iceberg.BaseRewriteFiles.validate(BaseRewriteFiles.java:108)
        at org.apache.iceberg.SnapshotProducer.apply(SnapshotProducer.java:175)
        at 
org.apache.iceberg.SnapshotProducer.lambda$commit$2(SnapshotProducer.java:296)
        at 
org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:404)
        at 
org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:214)
        at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:198)
        at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:190)
        at org.apache.iceberg.SnapshotProducer.commit(SnapshotProducer.java:295)
        at 
org.apache.iceberg.actions.RewriteDataFilesCommitManager.commitFileGroups(RewriteDataFilesCommitManager.java:89)
        at 
org.apache.iceberg.actions.RewriteDataFilesCommitManager.commitOrClean(RewriteDataFilesCommitManager.java:110)
        at 
org.apache.iceberg.spark.actions.RewriteDataFilesSparkAction.doExecute(RewriteDataFilesSparkAction.java:291)
        ... 8 more
   ```
   
   Question:
   Why are the equality deleted files lost? Is this correct or a bug?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Equality delete files lost after compact data files [iceberg]

Reply via email to