Re: [PR] Core: Support incremental compute for partition stats [iceberg]

via GitHub Thu, 27 Mar 2025 04:25:40 -0700


ajantha-bhat commented on code in PR #12629:
URL: https://github.com/apache/iceberg/pull/12629#discussion_r2016302587



##########
core/src/main/java/org/apache/iceberg/PartitionStatsUtil.java:
##########
@@ -45,22 +49,53 @@ private PartitionStatsUtil() {}
    * @param table the table for which partition stats to be computed.
    * @param snapshot the snapshot for which partition stats is computed.
    * @return the collection of {@link PartitionStats}
+   * @deprecated since 1.9.0, will be removed in 1.10.0; use {@link 
#computeStats(Table, Snapshot,
+   *     Snapshot)} instead.
    */
+  @Deprecated
   public static Collection<PartitionStats> computeStats(Table table, Snapshot 
snapshot) {
-    Preconditions.checkArgument(table != null, "table cannot be null");
-    Preconditions.checkArgument(Partitioning.isPartitioned(table), "table must 
be partitioned");
-    Preconditions.checkArgument(snapshot != null, "snapshot cannot be null");
+    return computeStats(table, null, snapshot).values();
+  }
 
-    StructType partitionType = Partitioning.partitionType(table);
-    List<ManifestFile> manifests = snapshot.allManifests(table.io());
-    Queue<PartitionMap<PartitionStats>> statsByManifest = 
Queues.newConcurrentLinkedQueue();
-    Tasks.foreach(manifests)
-        .stopOnFailure()
-        .throwFailureWhenFinished()
-        .executeWith(ThreadPools.getWorkerPool())
-        .run(manifest -> statsByManifest.add(collectStats(table, manifest, 
partitionType)));
+  /**
+   * Computes the partition stats incrementally after the given snapshot to 
current snapshot. If the
+   * given snapshot is null, computes the stats completely instead of 
incrementally.
+   *
+   * @param table the table for which partition stats to be computed.
+   * @param fromSnapshot the snapshot after which partition stats is computed 
(exclusive).
+   * @param currentSnapshot the snapshot till which partition stats is 
computed (inclusive).
+   * @return the {@link PartitionMap} of {@link PartitionStats}
+   */
+  public static PartitionMap<PartitionStats> computeStats(
+      Table table, Snapshot fromSnapshot, Snapshot currentSnapshot) {
+    Preconditions.checkArgument(table != null, "Table cannot be null");
+    Preconditions.checkArgument(Partitioning.isPartitioned(table), "Table must 
be partitioned");
+    Preconditions.checkArgument(currentSnapshot != null, "Current snapshot 
cannot be null");
 
-    return mergeStats(statsByManifest, table.specs());
+    Predicate<ManifestFile> manifestFilePredicate = file -> true;
+    if (fromSnapshot != null) {
+      Preconditions.checkArgument(currentSnapshot != fromSnapshot, "Both the 
snapshots are same");
+      Preconditions.checkArgument(
+          SnapshotUtil.isAncestorOf(table, currentSnapshot.snapshotId(), 
fromSnapshot.snapshotId()),
+          "Starting snapshot %s is not an ancestor of current snapshot %s",
+          fromSnapshot.snapshotId(),
+          currentSnapshot.snapshotId());
+      Set<Long> snapshotIdsRange =
+          Sets.newHashSet(
+              SnapshotUtil.ancestorIdsBetween(
+                  currentSnapshot.snapshotId(), fromSnapshot.snapshotId(), 
table::snapshot));
+      manifestFilePredicate =
+          manifestFile ->
+              snapshotIdsRange.contains(manifestFile.snapshotId())
+                  && !manifestFile.hasExistingFiles();

Review Comment:
   good point. 
   
   While computing incremental, I observed that it may become duplicate counts. 
So, I added. 
   I do have some gaps, I need to understand fully when and all we mark 
manifest entry as existing.
   Is there any scenario exist to consider "existing" entries or just "added" 
is enough? 
    
   There is another check down below, that considers both added and existing 
(added long back). 
   
https://github.com/apache/iceberg/blob/d54d81ecc1968eb0ec16a1f266e1d0b29e53b2ef/core/src/main/java/org/apache/iceberg/PartitionStatsUtil.java#L103
   
   I will update the code to just keep added entry and also add a testcase of 
rewrite data files to ensure stats are same after the rewrite. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Core: Support incremental compute for partition stats [iceberg]

Reply via email to