[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

GitBox Tue, 17 Jan 2023 12:54:51 -0800


huaxingao commented on code in PR #6582:
URL: https://github.com/apache/iceberg/pull/6582#discussion_r1072805278



##########
core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java:
##########
@@ -26,4 +26,6 @@ private StandardBlobTypes() {}
    * href="https://datasketches.apache.org/";>Apache DataSketches</a> library
    */
   public static final String APACHE_DATASKETCHES_THETA_V1 = 
"apache-datasketches-theta-v1";
+
+  public static final String NDV_BLOB = "ndv-blob";

Review Comment:
   @findepi Thanks a lot for your comment. I am wondering why Theta sketch is 
preferred over HLL. Is it because Theta sketch works better for larger data, 
and has better set intersection and difference operation?
   
   It would be ideal Spark can support Theta sketch, but it may take a long 
time for the support to get in. Before Theta sketch is available in Spark, if 
we can't put table level NDV in puffin file, is there a better place we can 
store this?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] huaxingao commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

Reply via email to