Re: [PR] Core: Fix incremental compute of partition stats [iceberg]

via GitHub Tue, 27 May 2025 21:19:00 -0700


lirui-apache commented on code in PR #13163:
URL: https://github.com/apache/iceberg/pull/13163#discussion_r2110839769



##########
core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java:
##########
@@ -336,16 +336,22 @@ private static PartitionMap<PartitionStats> 
computeStatsDiff(
         Sets.newHashSet(
             SnapshotUtil.ancestorIdsBetween(
                 toSnapshot.snapshotId(), fromSnapshot.snapshotId(), 
table::snapshot));
-    Predicate<ManifestFile> manifestFilePredicate =
-        manifestFile -> snapshotIdsRange.contains(manifestFile.snapshotId());
-    return computeStats(table, toSnapshot, manifestFilePredicate, true /* 
incremental */);
+    return computeStats(table, toSnapshot, snapshotIdsRange, true /* 
incremental */);
   }
 
   private static PartitionMap<PartitionStats> computeStats(
-      Table table, Snapshot snapshot, Predicate<ManifestFile> predicate, 
boolean incremental) {
+      Table table, Snapshot snapshot, Set<Long> snapshotIdsRange, boolean 
incremental) {
     StructType partitionType = Partitioning.partitionType(table);
-    List<ManifestFile> manifests =
-        
snapshot.allManifests(table.io()).stream().filter(predicate).collect(Collectors.toList());
+
+    List<ManifestFile> manifests;
+    if (incremental) {

Review Comment:
   I think we can tell if we're doing incremental compute by checking whether 
snapshotIdsRange is empty?



##########
core/src/test/java/org/apache/iceberg/PartitionStatsHandlerTestBase.java:
##########
@@ -441,6 +441,52 @@ public void testPartitionStats() throws Exception {
             snapshot1.snapshotId()));
   }
 
+  @Test
+  public void testCopyOnWriteDelete() throws Exception {
+    Table testTable =
+        TestTables.create(tempDir("my_test"), "my_test", SCHEMA, SPEC, 2, 
fileFormatProperty);
+
+    DataFile dataFile1 =
+        DataFiles.builder(SPEC)
+            .withPath("/df1.parquet")
+            .withPartitionPath("c2=a/c3=a")
+            .withFileSizeInBytes(10)
+            .withRecordCount(1)
+            .build();
+    DataFile dataFile2 =
+        DataFiles.builder(SPEC)
+            .withPath("/df2.parquet")
+            .withPartitionPath("c2=b/c3=b")
+            .withFileSizeInBytes(10)
+            .withRecordCount(1)
+            .build();
+
+    testTable.newAppend().appendFile(dataFile1).appendFile(dataFile2).commit();
+
+    PartitionStatisticsFile statisticsFile =
+        PartitionStatsHandler.computeAndWriteStatsFile(testTable);
+    
testTable.updatePartitionStatistics().setPartitionStatistics(statisticsFile).commit();
+
+    assertThat(
+            PartitionStatsHandler.readPartitionStatsFile(
+                
PartitionStatsHandler.schema(Partitioning.partitionType(testTable)),
+                Files.localInput(statisticsFile.path())))
+        .allMatch(s -> (s.dataRecordCount() != 0 && s.dataFileCount() != 0));
+
+    testTable.newDelete().deleteFile(dataFile1).commit();
+    testTable.newDelete().deleteFile(dataFile2).commit();
+
+    PartitionStatisticsFile statisticsFileNew =
+        PartitionStatsHandler.computeAndWriteStatsFile(testTable);
+
+    // stats must be decremented to zero as all the files removed from table.
+    assertThat(
+            PartitionStatsHandler.readPartitionStatsFile(
+                
PartitionStatsHandler.schema(Partitioning.partitionType(testTable)),
+                Files.localInput(statisticsFileNew.path())))

Review Comment:
   BTW, I noted Files.localInput is also used in production code 
[PartitionStatsHandler::computeAndMergeStatsIncremental](https://github.com/apache/iceberg/blob/cab0decbb0e32bf314039e30807eb033c50665d5/core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java#L279)
 to load the previous stats file. Shouldn't we use table's FileIO to do that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Fix incremental compute of partition stats [iceberg]

Reply via email to