aokolnychyi commented on code in PR #9454: URL: https://github.com/apache/iceberg/pull/9454#discussion_r1457922682
########## core/src/main/java/org/apache/iceberg/ManifestFilterManager.java: ########## @@ -289,13 +321,38 @@ private void invalidateFilteredCache() { cleanUncommitted(SnapshotProducer.EMPTY_SET); } + private void recordPartitionMinDataSequenceNumber(ManifestFile manifest) { Review Comment: The existing approach only looks at manifest metadata to understand the min data sequence number across all partitions. This is really cheap as we don't have to open manifests (which can be a really expensive operation). That leads to the problem if one partition is significantly behind, it prevents garbage collection of delete files in other partitions. We have solved that for position deletes via the `rewritePositionDeletes` action but it still remains open for equality deletes. I am not convinced opening these manifests during commits is a good idea. Can we explore the option of leveraging the partition stats spec added recently? We are still building an action to generate those stats but let's think through whether it can help us. One option can be to check if the partition stats file is present and use it populate the min data sequence numbers, opening just a single Parquet file vs potentially tons of manifests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org