karuppayya commented on code in PR #10659: URL: https://github.com/apache/iceberg/pull/10659#discussion_r1674793152
########## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java: ########## @@ -175,7 +181,25 @@ public Statistics estimateStatistics() { protected Statistics estimateStatistics(Snapshot snapshot) { // its a fresh table, no data if (snapshot == null) { - return new Stats(0L, 0L); + return new Stats(0L, 0L, Maps.newHashMap()); + } + + Map<NamedReference, ColumnStatistics> map = Maps.newHashMap(); + + if (readConf.enableColumnStats()) { + List<StatisticsFile> files = table.statisticsFiles(); + if (!files.isEmpty()) { + List<BlobMetadata> metadataList = (files.get(0)).blobMetadata(); Review Comment: Can there be more than one statistic files? Like for a table with `col1`, `col2`, `statsFile1` has stats of `col1`, and `statsFile2` has stats of `col2`? In which case we need to go over all the files? ########## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java: ########## @@ -175,7 +181,25 @@ public Statistics estimateStatistics() { protected Statistics estimateStatistics(Snapshot snapshot) { // its a fresh table, no data if (snapshot == null) { - return new Stats(0L, 0L); + return new Stats(0L, 0L, Maps.newHashMap()); + } + + Map<NamedReference, ColumnStatistics> map = Maps.newHashMap(); + + if (readConf.enableColumnStats()) { + List<StatisticsFile> files = table.statisticsFiles(); + if (!files.isEmpty()) { + List<BlobMetadata> metadataList = (files.get(0)).blobMetadata(); Review Comment: Looks like [code](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1281) doesn't allow setting more than one Statistics file. I think we should append to to the List of stats files here? cc: @findepi ########## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java: ########## @@ -175,7 +181,25 @@ public Statistics estimateStatistics() { protected Statistics estimateStatistics(Snapshot snapshot) { // its a fresh table, no data if (snapshot == null) { - return new Stats(0L, 0L); + return new Stats(0L, 0L, Maps.newHashMap()); + } + + Map<NamedReference, ColumnStatistics> map = Maps.newHashMap(); + + if (readConf.enableColumnStats()) { + List<StatisticsFile> files = table.statisticsFiles(); + if (!files.isEmpty()) { + List<BlobMetadata> metadataList = (files.get(0)).blobMetadata(); + + for (BlobMetadata blobMetadata : metadataList) { + long ndv = Long.parseLong(blobMetadata.properties().get("ndv")); Review Comment: Also check `org.apache.iceberg.BlobMetadata#type`? ########## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java: ########## @@ -347,4 +347,12 @@ private boolean executorCacheLocalityEnabledInternal() { .defaultValue(SparkSQLProperties.EXECUTOR_CACHE_LOCALITY_ENABLED_DEFAULT) .parse(); } + + public boolean enableColumnStats() { Review Comment: @huaxingao Should we need remove this flag? If statistics is available, we always send it to Spark, since we are not computing here? Spark can decide to use it based on whether CBO is enabled? Do we see a case where stats is available, but dont want to send them to Spark? ########## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java: ########## @@ -175,7 +181,25 @@ public Statistics estimateStatistics() { protected Statistics estimateStatistics(Snapshot snapshot) { // its a fresh table, no data if (snapshot == null) { - return new Stats(0L, 0L); + return new Stats(0L, 0L, Maps.newHashMap()); + } + + Map<NamedReference, ColumnStatistics> map = Maps.newHashMap(); + + if (readConf.enableColumnStats()) { + List<StatisticsFile> files = table.statisticsFiles(); + if (!files.isEmpty()) { + List<BlobMetadata> metadataList = (files.get(0)).blobMetadata(); + + for (BlobMetadata blobMetadata : metadataList) { + long ndv = Long.parseLong(blobMetadata.properties().get("ndv")); + ColumnStatistics colStats = new ColStats(ndv, null, null, 0L, 0L, 0L, null); + int id = blobMetadata.fields().get(0); + String colName = table.schema().findColumnName(id); + NamedReference ref = FieldReference.column(colName); Review Comment: `org.apache.iceberg.spark.Spark3Util#toNamedReference`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org