singhpk234 commented on code in PR #12270:
URL: https://github.com/apache/iceberg/pull/12270#discussion_r1961054056
##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RemoveDanglingDeletesSparkAction.java:
##########
@@ -156,7 +162,12 @@ private List<DeleteFile> findDanglingDeletes() {
.or(
col("data_file.content")
.equalTo("2")
-
.and(col("sequence_number").$less$eq(col("min_data_sequence_number"))));
+
.and(col("sequence_number").$less$eq(col("min_data_sequence_number"))))
+ // dvs pointing to non-existing data files
+ .or(
+ col("data_file.file_format")
+ .equalTo(FileFormat.PUFFIN.name())
Review Comment:
Apologies for the confusion, this comment was meant to be in the line below,
essentially where we matching the data file path with the file path puffin is
pointing to.
can having an exact equality check lead to miss ? for ex consider in the
table if file_path 's3://<tbl_location>/filea.parquet' exists but Puffin files
point to 's3a://<tbl_location>/filea.parquet' since we do exact not eq check
this case can be missed as only diff is S3 and S3a but the file is there ?
Hence was recommending the above
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]