[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

GitBox Tue, 17 Jan 2023 16:45:38 -0800


huaxingao commented on code in PR #6582:
URL: https://github.com/apache/iceberg/pull/6582#discussion_r1072976064



##########
core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java:
##########
@@ -26,4 +26,6 @@ private StandardBlobTypes() {}
    * href="https://datasketches.apache.org/";>Apache DataSketches</a> library
    */
   public static final String APACHE_DATASKETCHES_THETA_V1 = 
"apache-datasketches-theta-v1";
+
+  public static final String NDV_BLOB = "ndv-blob";

Review Comment:
   @findepi @rdblue 
   
   I have a side question for puffin file: Is puffin file only used to store 
table level stats? 
   
   I am currently taking a look at the file level bloom filter and wondering 
where I should save them. Right now we only have row group bloom filter 
support. If we can add file level bloom filter support, then at planning time, 
if a file doesn't contain the value we are looking for, we can use bloom filter 
to filter out this file and don't include it in `FileScanTask`. I was 
originally thinking of adding the file level bloom filters as blobs in puffin 
file. However, from the discussion we are having on NDV, I feel that puffin 
file is only intended for table level stats. If this is true, do we have a 
place to put file level stats that is large (e.g. bloom filter)?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

Reply via email to