wypoon commented on code in PR #12647: URL: https://github.com/apache/iceberg/pull/12647#discussion_r2017293425
########## spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java: ########## @@ -195,8 +196,10 @@ protected Statistics estimateStatistics(Snapshot snapshot) { if (readConf.reportColumnStats() && cboEnabled) { colStatsMap = Maps.newHashMap(); List<StatisticsFile> files = table.statisticsFiles(); - if (!files.isEmpty()) { - List<BlobMetadata> metadataList = (files.get(0)).blobMetadata(); + Optional<StatisticsFile> file = + files.stream().filter(f -> f.snapshotId() == snapshot.snapshotId()).findFirst(); Review Comment: The spec doesn't actually say that there should only be *one* statistics file per snapshot. This happens to be how it is implemented in Java. The spec simply allows for multiple statistics files. I was thinking about the problem of tracking orphaned statistics files when they are recomputed. One idea I had was to keep replaced statistics files (for a snapshot) still in the list (as long as the files are tracked in metadata we can clean up unused ones), but to keep the newest one before others. Hence `findFirst`. It was just an idea (and honestly not one I'm seriously considering). In any case, I do not think that `findAny` is faster than `findFirst` here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org