Re: [PR] Support Spark Column Stats [iceberg]

via GitHub Fri, 19 Jul 2024 01:59:30 -0700


findepi commented on code in PR #10659:
URL: https://github.com/apache/iceberg/pull/10659#discussion_r1683949890



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##########
@@ -175,7 +184,37 @@ public Statistics estimateStatistics() {
   protected Statistics estimateStatistics(Snapshot snapshot) {
     // its a fresh table, no data
     if (snapshot == null) {
-      return new Stats(0L, 0L);
+      return new Stats(0L, 0L, Collections.emptyMap());
+    }
+
+    boolean cboEnabled =
+        Boolean.parseBoolean(spark.conf().get(SQLConf.CBO_ENABLED().key(), 
"false"));
+    Map<NamedReference, ColumnStatistics> colStatsMap = null;
+    if (readConf.enableColumnStats() && cboEnabled) {
+      colStatsMap = Maps.newHashMap();
+      List<StatisticsFile> files = table.statisticsFiles();
+      if (!files.isEmpty()) {
+        List<BlobMetadata> metadataList = (files.get(0)).blobMetadata();
+
+        for (BlobMetadata blobMetadata : metadataList) {
+          int id = blobMetadata.fields().get(0);

Review Comment:
   Correct, the `apache-datasketches-theta-v1` should be calculated on one 
field.
   And yes, there should be the `ndv` property set. The property may seem 
somewhat redundant within the Puffin file, but allow faster access to the 
information at SELECT-time. More importantly, the properties are propagated to 
the table metadata and so a query planner accesses the NDV information without 
opening the Puffin file at all.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Support Spark Column Stats [iceberg]

Reply via email to