Re: [PR] Spark: Use correct statistics file in SparkScan::estimateStatistics(Snapshot) [iceberg]

via GitHub Sat, 08 Mar 2025 12:37:47 -0800


wypoon commented on code in PR #12482:
URL: https://github.com/apache/iceberg/pull/12482#discussion_r1986150146



##########
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java:
##########
@@ -279,12 +268,50 @@ public void testTableWithOneColStats() throws 
NoSuchTableException {
 
     table.updateStatistics().setStatistics(statisticsFile).commit();
 
-    checkColStatisticsNotReported(scan, 4L);
-    withSQLConf(reportColStatsDisabled, () -> 
checkColStatisticsNotReported(scan, 4L));
+    List<SimpleRecord> newRecords =
+        Lists.newArrayList(new SimpleRecord(5, "a"), new SimpleRecord(6, "b"));
+    spark
+        .createDataset(newRecords, Encoders.bean(SimpleRecord.class))
+        .coalesce(1)
+        .writeTo(tableName)
+        .append();
 
-    Map<String, Long> expectedOneNDV = Maps.newHashMap();
-    expectedOneNDV.put("id", 4L);
-    withSQLConf(reportColStatsEnabled, () -> checkColStatisticsReported(scan, 
4L, expectedOneNDV));
+    table.refresh();
+    long snapshotId2 = table.currentSnapshot().snapshotId();
+
+    GenericStatisticsFile statisticsFile2 =
+        new GenericStatisticsFile(
+            snapshotId2,
+            "/test/statistics/file2.puffin",
+            100,
+            42,
+            ImmutableList.of(
+                new GenericBlobMetadata(
+                    APACHE_DATASKETCHES_THETA_V1,
+                    snapshotId2,
+                    2,
+                    ImmutableList.of(1),
+                    ImmutableMap.of("ndv", "6"))));
+
+    table.updateStatistics().setStatistics(statisticsFile2).commit();
+
+    SparkScanBuilder scanBuilder =
+        new SparkScanBuilder(spark, table, CaseInsensitiveStringMap.empty());
+    SparkScan scan = (SparkScan) scanBuilder.build();
+
+    Map<String, String> reportColStatsDisabled =
+        ImmutableMap.of(
+            SQLConf.CBO_ENABLED().key(), "true", 
SparkSQLProperties.REPORT_COLUMN_STATS, "false");
+
+    checkColStatisticsNotReported(scan, 6L);
+    withSQLConf(reportColStatsDisabled, () -> 
checkColStatisticsNotReported(scan, 6L));
+
+    Map<String, String> reportColStatsEnabled =
+        ImmutableMap.of(SQLConf.CBO_ENABLED().key(), "true");
+
+    Map<String, Long> expectedNDV = Maps.newHashMap();
+    expectedNDV.put("id", 6L);
+    withSQLConf(reportColStatsEnabled, () -> checkColStatisticsReported(scan, 
6L, expectedNDV));

Review Comment:
   The test is parameterized with three parameters. When run on its own without 
the fix, from one to three cases will fail. The reason some of the time the 
correct `StatisticsFile` appears as the first in the `List` is that when 
`TableMetadata` is built 
(https://github.com/apache/iceberg/blob/9a8466c95c323eba7c427cafb44a6c91a81d4bf4/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1602),
 the `List` is built from a `Map` and the order of the entries depend on the 
hashing of the snapshotId (which is random).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Use correct statistics file in SparkScan::estimateStatistics(Snapshot) [iceberg]

Reply via email to