ajantha-bhat commented on code in PR #13163: URL: https://github.com/apache/iceberg/pull/13163#discussion_r2111663422
########## core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java: ########## @@ -336,16 +336,29 @@ private static PartitionMap<PartitionStats> computeStatsDiff( Sets.newHashSet( SnapshotUtil.ancestorIdsBetween( toSnapshot.snapshotId(), fromSnapshot.snapshotId(), table::snapshot)); - Predicate<ManifestFile> manifestFilePredicate = - manifestFile -> snapshotIdsRange.contains(manifestFile.snapshotId()); - return computeStats(table, toSnapshot, manifestFilePredicate, true /* incremental */); + return computeStats(table, toSnapshot, snapshotIdsRange); } private static PartitionMap<PartitionStats> computeStats( - Table table, Snapshot snapshot, Predicate<ManifestFile> predicate, boolean incremental) { + Table table, Snapshot snapshot, Set<Long> snapshotIdsRange) { StructType partitionType = Partitioning.partitionType(table); - List<ManifestFile> manifests = - snapshot.allManifests(table.io()).stream().filter(predicate).collect(Collectors.toList()); + boolean incremental = !snapshotIdsRange.isEmpty(); + + List<ManifestFile> manifests; + if (incremental) { + // DELETED manifest entries are not carried over to subsequent snapshots. + // So, for incremental computation, gather the manifests added by each snapshot + // instead of relying solely on those from the latest snapshot. + manifests = + snapshotIdsRange.stream() + .flatMap( + id -> + table.snapshot(id).allManifests(table.io()).stream() + .filter(file -> file.snapshotId().equals(id))) + .collect(Collectors.toList()); Review Comment: Also note that, because of snapshot id filter, Each snapshot's added manifest files will be considered only once for compute. So, reused manifests won't be considered again. If manifests are rewritten, entries will be marked as EXISTING and won't be considered for incremental compute from existing logic in `collectStatsForManifest`. So, IMO it works for all the scenarios now and we have testcase to cover all the scenarios. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org