[GitHub] [iceberg] findepi commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

GitBox Wed, 18 Jan 2023 00:54:08 -0800


findepi commented on code in PR #6582:
URL: https://github.com/apache/iceberg/pull/6582#discussion_r1073253298



##########
core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java:
##########
@@ -26,4 +26,6 @@ private StandardBlobTypes() {}
    * href="https://datasketches.apache.org/";>Apache DataSketches</a> library
    */
   public static final String APACHE_DATASKETCHES_THETA_V1 = 
"apache-datasketches-theta-v1";
+
+  public static final String NDV_BLOB = "ndv-blob";

Review Comment:
   > I am wondering why Theta sketch is preferred over HLL. Is it because Theta 
sketch works better for larger data, and has better set intersection and 
difference operation?
   
   Theta supports union, intersection and difference, HLL supports only union.
   In my testing they were roughly similar in NDV quality for unions.
   See results in 
https://github.com/trinodb/trino/pull/14290#issuecomment-1262160870
   
   > I have a side question for puffin file: Is puffin file only used to store 
table level stats?
   
   currently, yes
   this is how Puffin is **currently** integrated in Iceberg spec.
   
   there is ongoing work for having partition-level stats, see eg 
https://github.com/apache/iceberg/pull/1985 
https://github.com/apache/iceberg/issues/1832 
https://github.com/apache/iceberg/issues/1833
   
   > I am currently taking a look at the file level bloom filter and wondering 
where I should save them
   
   this comment from @rdblue looks relevant 
https://github.com/apache/iceberg/issues/1832#issuecomment-757072379



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] findepi commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

Reply via email to