Re: [PR] Updating SparkScan to only read Apache DataSketches [iceberg]

via GitHub Wed, 28 Aug 2024 05:04:29 -0700


guykhazma commented on code in PR #11035:
URL: https://github.com/apache/iceberg/pull/11035#discussion_r1734554161



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##########
@@ -199,28 +199,24 @@ protected Statistics estimateStatistics(Snapshot 
snapshot) {
         List<BlobMetadata> metadataList = (files.get(0)).blobMetadata();
 
         for (BlobMetadata blobMetadata : metadataList) {
-          int id = blobMetadata.fields().get(0);
-          String colName = table.schema().findColumnName(id);
-          NamedReference ref = FieldReference.column(colName);
-
-          Long ndv = null;
           if (blobMetadata
               .type()
               
.equals(org.apache.iceberg.puffin.StandardBlobTypes.APACHE_DATASKETCHES_THETA_V1))
 {
+            int id = blobMetadata.fields().get(0);
+            String colName = table.schema().findColumnName(id);
+            NamedReference ref = FieldReference.column(colName);
+            Long ndv = null;
             String ndvStr = blobMetadata.properties().get(NDV_KEY);
             if (!Strings.isNullOrEmpty(ndvStr)) {
               ndv = Long.parseLong(ndvStr);
             } else {
               LOG.debug("ndv is not set in BlobMetadata for column {}", 
colName);
             }
-          } else {
-            LOG.debug("DataSketch blob is not available for column {}", 
colName);
-          }
+            ColumnStatistics colStats =

Review Comment:
   Technically we should group the metadata by field first and then extract all 
of the relevant metadata and create the SparkColumnStatistics instance for the 
column
   This is not specifically related to this PR because this was the behaviour 
before but we might want to address it as well.



##########
core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java:
##########
@@ -26,4 +26,6 @@ private StandardBlobTypes() {}
    * href="https://datasketches.apache.org/";>Apache DataSketches</a> library
    */
   public static final String APACHE_DATASKETCHES_THETA_V1 = 
"apache-datasketches-theta-v1";
+
+  public static final String PRESTO_SUM_DATA_SIZE_BYTES_V1 = 
"presto-sum-data-size-bytes-v1";

Review Comment:
   we don't need to store the exact parameter used by presto as part of iceberg.
   we can use it in the test or even use a dummy  identifier to simulate the 
existence of additional non supported metadata.
   
   separately we should reach agreement on what is the right way to store the 
data size in the puffin file cross engines.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Updating SparkScan to only read Apache DataSketches [iceberg]

Reply via email to