Re: [PR] Spark 3.4 : Use correct statistics file in SparkScan::estimateStatistics(Snapshot) [iceberg]

via GitHub Thu, 27 Mar 2025 10:47:39 -0700


wypoon commented on code in PR #12647:
URL: https://github.com/apache/iceberg/pull/12647#discussion_r2017293425



##########
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##########
@@ -195,8 +196,10 @@ protected Statistics estimateStatistics(Snapshot snapshot) 
{
     if (readConf.reportColumnStats() && cboEnabled) {
       colStatsMap = Maps.newHashMap();
       List<StatisticsFile> files = table.statisticsFiles();
-      if (!files.isEmpty()) {
-        List<BlobMetadata> metadataList = (files.get(0)).blobMetadata();
+      Optional<StatisticsFile> file =
+          files.stream().filter(f -> f.snapshotId() == 
snapshot.snapshotId()).findFirst();

Review Comment:
   The spec doesn't actually say that there should only be *one* statistics 
file per snapshot. This happens to be how it is implemented in Java. The spec 
simply allows for multiple statistics files.
   I was thinking about the problem of tracking orphaned statistics files when 
they are recomputed. One idea I had was to keep replaced statistics files (for 
a snapshot) still in the list (as long as the files are tracked in metadata we 
can clean up unused ones), but to keep the newest one before others. Hence 
`findFirst`. It was just an idea (and honestly not one I'm seriously 
considering).
   In any case, I do not think that `findAny` is faster than `findFirst` here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 3.4 : Use correct statistics file in SparkScan::estimateStatistics(Snapshot) [iceberg]

Reply via email to