Re: [PR] Use Snapshot's statistics file in SparkScan [iceberg]

via GitHub Fri, 20 Sep 2024 12:35:49 -0700


amogh-jahagirdar commented on code in PR #11040:
URL: https://github.com/apache/iceberg/pull/11040#discussion_r1769162105



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##########
@@ -194,9 +195,9 @@ protected Statistics estimateStatistics(Snapshot snapshot) {
     Map<NamedReference, ColumnStatistics> colStatsMap = Collections.emptyMap();
     if (readConf.reportColumnStats() && cboEnabled) {
       colStatsMap = Maps.newHashMap();
-      List<StatisticsFile> files = table.statisticsFiles();
-      if (!files.isEmpty()) {
-        List<BlobMetadata> metadataList = (files.get(0)).blobMetadata();
+      Optional<StatisticsFile> statisticsFile = statisticsFile(snapshot);
+      if (statisticsFile.isPresent()) {
+        List<BlobMetadata> metadataList = statisticsFile.get().blobMetadata();

Review Comment:
   @karuppayya Sorry for missing this earlier, I think we may want to consider 
a table API for resolving a statistics file based on a snapshot, 
`statisticsFileFor`. The implementation of that API could just do a best effort 
search of the statistics file for a given snapshot, and if one cannot be found 
just return the most recent one. 
   
   If an engine integration needs the exact statistics and the API response 
isn't it, that's OK since the engine can then just ignore the statistics file. 
But i think in the most common cases, having an out of date statistics file is 
probably acceptable and so the API should probably default to the best effort 
lookup.
   
   This is analagous to what happens in view.dialectFor API where a best effort 
for a given dialect is searched but if one cannot be found the first 
representation is returned. Engines like Trino which require the strict dialect 
can use the API response and compare against the desired and fail accordingly. 
Other engines like Spark don't do the strict lookup and just take the response 
as is.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Use Snapshot's statistics file in SparkScan [iceberg]

Reply via email to