singhpk234 commented on code in PR #12270:
URL: https://github.com/apache/iceberg/pull/12270#discussion_r1961054056


##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java:
##########
@@ -156,7 +162,12 @@ private List<DeleteFile> findDanglingDeletes() {
             .or(
                 col("data_file.content")
                     .equalTo("2")
-                    
.and(col("sequence_number").$less$eq(col("min_data_sequence_number"))));
+                    
.and(col("sequence_number").$less$eq(col("min_data_sequence_number"))))
+            // dvs pointing to non-existing data files
+            .or(
+                col("data_file.file_format")
+                    .equalTo(FileFormat.PUFFIN.name())

Review Comment:
   Apologies for the confusion, this comment was meant to be in the line below, 
essentially where we matching the data file path with the file path puffin is 
pointing to.
   
   can having an exact equality check lead to miss ? for ex consider in the 
table if file_path 's3://<tbl_location>/filea.parquet' exists but Puffin files 
point to 's3a://<tbl_location>/filea.parquet' since we do exact not eq check 
this case can be missed as only diff is S3 and S3a but the file is there ? 
   
   Hence was recommending the above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to