guykhazma commented on code in PR #11035: URL: https://github.com/apache/iceberg/pull/11035#discussion_r1734554161
########## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java: ########## @@ -199,28 +199,24 @@ protected Statistics estimateStatistics(Snapshot snapshot) { List<BlobMetadata> metadataList = (files.get(0)).blobMetadata(); for (BlobMetadata blobMetadata : metadataList) { - int id = blobMetadata.fields().get(0); - String colName = table.schema().findColumnName(id); - NamedReference ref = FieldReference.column(colName); - - Long ndv = null; if (blobMetadata .type() .equals(org.apache.iceberg.puffin.StandardBlobTypes.APACHE_DATASKETCHES_THETA_V1)) { + int id = blobMetadata.fields().get(0); + String colName = table.schema().findColumnName(id); + NamedReference ref = FieldReference.column(colName); + Long ndv = null; String ndvStr = blobMetadata.properties().get(NDV_KEY); if (!Strings.isNullOrEmpty(ndvStr)) { ndv = Long.parseLong(ndvStr); } else { LOG.debug("ndv is not set in BlobMetadata for column {}", colName); } - } else { - LOG.debug("DataSketch blob is not available for column {}", colName); - } + ColumnStatistics colStats = Review Comment: Technically we should group the metadata by field first and then extract all of the relevant metadata and create the SparkColumnStatistics instance for the column This is not specifically related to this PR because this was the behaviour before but we might want to address it as well. ########## core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java: ########## @@ -26,4 +26,6 @@ private StandardBlobTypes() {} * href="https://datasketches.apache.org/">Apache DataSketches</a> library */ public static final String APACHE_DATASKETCHES_THETA_V1 = "apache-datasketches-theta-v1"; + + public static final String PRESTO_SUM_DATA_SIZE_BYTES_V1 = "presto-sum-data-size-bytes-v1"; Review Comment: we don't need to store the exact parameter used by presto as part of iceberg. we can use it in the test or even use a dummy identifier to simulate the existence of additional non supported metadata. separately we should reach agreement on what is the right way to store the data size in the puffin file cross engines. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org