Re: [PR] Core: Support incremental compute for partition stats [iceberg]

via GitHub Mon, 07 Apr 2025 04:41:02 -0700


gaborkaszab commented on code in PR #12629:
URL: https://github.com/apache/iceberg/pull/12629#discussion_r2031057738



##########
core/src/main/java/org/apache/iceberg/PartitionStatsUtil.java:
##########
@@ -40,27 +44,48 @@ public class PartitionStatsUtil {
   private PartitionStatsUtil() {}
 
   /**
-   * Computes the partition stats for the given snapshot of the table.
+   * Fully computes the partition stats for the given snapshot of the table.
    *
    * @param table the table for which partition stats to be computed.
    * @param snapshot the snapshot for which partition stats is computed.
    * @return the collection of {@link PartitionStats}
    */
   public static Collection<PartitionStats> computeStats(Table table, Snapshot 
snapshot) {
-    Preconditions.checkArgument(table != null, "table cannot be null");
-    Preconditions.checkArgument(Partitioning.isPartitioned(table), "table must 
be partitioned");
-    Preconditions.checkArgument(snapshot != null, "snapshot cannot be null");
+    Preconditions.checkArgument(table != null, "Table cannot be null");
+    Preconditions.checkArgument(Partitioning.isPartitioned(table), "Table must 
be partitioned");
+    Preconditions.checkArgument(snapshot != null, "Current snapshot cannot be 
null");
 
-    StructType partitionType = Partitioning.partitionType(table);
-    List<ManifestFile> manifests = snapshot.allManifests(table.io());
-    Queue<PartitionMap<PartitionStats>> statsByManifest = 
Queues.newConcurrentLinkedQueue();
-    Tasks.foreach(manifests)
-        .stopOnFailure()
-        .throwFailureWhenFinished()
-        .executeWith(ThreadPools.getWorkerPool())
-        .run(manifest -> statsByManifest.add(collectStats(table, manifest, 
partitionType)));
+    return collectStats(table, snapshot, file -> true, false /* incremental 
*/).values();
+  }
 
-    return mergeStats(statsByManifest, table.specs());
+  /**
+   * Incrementally computes the partition stats after the given snapshot to 
current snapshot.
+   *
+   * @param table the table for which partition stats to be computed.
+   * @param fromSnapshot the snapshot after which partition stats is computed 
(exclusive).
+   * @param currentSnapshot the snapshot till which partition stats is 
computed (inclusive).
+   * @return the {@link PartitionMap} of {@link PartitionStats}
+   */
+  public static PartitionMap<PartitionStats> computeStatsIncremental(

Review Comment:
   I think this naming could be misleading: users might expect that calling 
this function will produce the stats ready to be used, but after checking the 
code I have the impression that this produces the stats diffs between 2 
particular snapshots and won't create 'ready-to-use' stats. Users have to 
additionally merge the stats returned by this function with the stats from 
`fromSnapshot`.
   As option a) can't we do that 'final' merge also within this function and 
not leave it to the caller?
   option b) would be to clearly express that users get stat diff that they 
have to merge with the last known stats.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Core: Support incremental compute for partition stats [iceberg]

Reply via email to