findepi commented on code in PR #10659: URL: https://github.com/apache/iceberg/pull/10659#discussion_r1683949890
########## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java: ########## @@ -175,7 +184,37 @@ public Statistics estimateStatistics() { protected Statistics estimateStatistics(Snapshot snapshot) { // its a fresh table, no data if (snapshot == null) { - return new Stats(0L, 0L); + return new Stats(0L, 0L, Collections.emptyMap()); + } + + boolean cboEnabled = + Boolean.parseBoolean(spark.conf().get(SQLConf.CBO_ENABLED().key(), "false")); + Map<NamedReference, ColumnStatistics> colStatsMap = null; + if (readConf.enableColumnStats() && cboEnabled) { + colStatsMap = Maps.newHashMap(); + List<StatisticsFile> files = table.statisticsFiles(); + if (!files.isEmpty()) { + List<BlobMetadata> metadataList = (files.get(0)).blobMetadata(); + + for (BlobMetadata blobMetadata : metadataList) { + int id = blobMetadata.fields().get(0); Review Comment: Correct, the `apache-datasketches-theta-v1` should be calculated on one field. And yes, there should be the `ndv` property set. The property may seem somewhat redundant within the Puffin file, but allow faster access to the information at SELECT-time. More importantly, the properties are propagated to the table metadata and so a query planner accesses the NDV information without opening the Puffin file at all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org