Re: [PR] Support Spark Column Stats [iceberg]

via GitHub Thu, 11 Jul 2024 18:16:00 -0700


karuppayya commented on code in PR #10659:
URL: https://github.com/apache/iceberg/pull/10659#discussion_r1674793152



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##########
@@ -175,7 +181,25 @@ public Statistics estimateStatistics() {
   protected Statistics estimateStatistics(Snapshot snapshot) {
     // its a fresh table, no data
     if (snapshot == null) {
-      return new Stats(0L, 0L);
+      return new Stats(0L, 0L, Maps.newHashMap());
+    }
+
+    Map<NamedReference, ColumnStatistics> map = Maps.newHashMap();
+
+    if (readConf.enableColumnStats()) {
+      List<StatisticsFile> files = table.statisticsFiles();
+      if (!files.isEmpty()) {
+        List<BlobMetadata> metadataList = (files.get(0)).blobMetadata();

Review Comment:
   Can there be more than one statistic files?
   Like for a table with `col1`, `col2`,
   `statsFile1` has stats of `col1`, and `statsFile2` has stats of `col2`?
   In which case we need to go over all the files?



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##########
@@ -175,7 +181,25 @@ public Statistics estimateStatistics() {
   protected Statistics estimateStatistics(Snapshot snapshot) {
     // its a fresh table, no data
     if (snapshot == null) {
-      return new Stats(0L, 0L);
+      return new Stats(0L, 0L, Maps.newHashMap());
+    }
+
+    Map<NamedReference, ColumnStatistics> map = Maps.newHashMap();
+
+    if (readConf.enableColumnStats()) {
+      List<StatisticsFile> files = table.statisticsFiles();
+      if (!files.isEmpty()) {
+        List<BlobMetadata> metadataList = (files.get(0)).blobMetadata();

Review Comment:
   Looks like 
[code](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1281)
 doesn't allow setting more than one Statistics file.
   I think we should append to to the List of stats files here? 
   cc: @findepi 



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##########
@@ -175,7 +181,25 @@ public Statistics estimateStatistics() {
   protected Statistics estimateStatistics(Snapshot snapshot) {
     // its a fresh table, no data
     if (snapshot == null) {
-      return new Stats(0L, 0L);
+      return new Stats(0L, 0L, Maps.newHashMap());
+    }
+
+    Map<NamedReference, ColumnStatistics> map = Maps.newHashMap();
+
+    if (readConf.enableColumnStats()) {
+      List<StatisticsFile> files = table.statisticsFiles();
+      if (!files.isEmpty()) {
+        List<BlobMetadata> metadataList = (files.get(0)).blobMetadata();
+
+        for (BlobMetadata blobMetadata : metadataList) {
+          long ndv = Long.parseLong(blobMetadata.properties().get("ndv"));

Review Comment:
   Also check `org.apache.iceberg.BlobMetadata#type`?



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java:
##########
@@ -347,4 +347,12 @@ private boolean executorCacheLocalityEnabledInternal() {
         
.defaultValue(SparkSQLProperties.EXECUTOR_CACHE_LOCALITY_ENABLED_DEFAULT)
         .parse();
   }
+
+  public boolean enableColumnStats() {

Review Comment:
   @huaxingao Should we need remove this flag?
   If statistics is available, we always send it to Spark, since we are not 
computing here? Spark can decide to use it based on whether CBO is enabled?
   Do we see a case where stats is available, but dont want to send them to 
Spark?



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##########
@@ -175,7 +181,25 @@ public Statistics estimateStatistics() {
   protected Statistics estimateStatistics(Snapshot snapshot) {
     // its a fresh table, no data
     if (snapshot == null) {
-      return new Stats(0L, 0L);
+      return new Stats(0L, 0L, Maps.newHashMap());
+    }
+
+    Map<NamedReference, ColumnStatistics> map = Maps.newHashMap();
+
+    if (readConf.enableColumnStats()) {
+      List<StatisticsFile> files = table.statisticsFiles();
+      if (!files.isEmpty()) {
+        List<BlobMetadata> metadataList = (files.get(0)).blobMetadata();
+
+        for (BlobMetadata blobMetadata : metadataList) {
+          long ndv = Long.parseLong(blobMetadata.properties().get("ndv"));
+          ColumnStatistics colStats = new ColStats(ndv, null, null, 0L, 0L, 
0L, null);
+          int id = blobMetadata.fields().get(0);
+          String colName = table.schema().findColumnName(id);
+          NamedReference ref = FieldReference.column(colName);

Review Comment:
   `org.apache.iceberg.spark.Spark3Util#toNamedReference`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Support Spark Column Stats [iceberg]

Reply via email to