aokolnychyi commented on code in PR #9454:
URL: https://github.com/apache/iceberg/pull/9454#discussion_r1457922682


##########
core/src/main/java/org/apache/iceberg/ManifestFilterManager.java:
##########
@@ -289,13 +321,38 @@ private void invalidateFilteredCache() {
     cleanUncommitted(SnapshotProducer.EMPTY_SET);
   }
 
+  private void recordPartitionMinDataSequenceNumber(ManifestFile manifest) {

Review Comment:
   The existing approach only looks at manifest metadata to understand the min 
data sequence number across all partitions. This is really cheap as we don't 
have to open manifests (which can be a really expensive operation). That leads 
to the problem if one partition is significantly behind, it prevents garbage 
collection of delete files in other partitions. We have solved that for 
position deletes via the `rewritePositionDeletes` action but it still remains 
open for equality deletes.
   
   I am not convinced opening these manifests during commits is a good idea. 
Can we explore the option of leveraging the partition stats spec added 
recently? We are still building an action to generate those stats but let's 
think through whether it can help us. One option can be to check if the 
partition stats file is present and use it populate the min data sequence 
numbers, opening just a single Parquet file vs potentially tons of manifests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to