[GitHub] [iceberg] findepi commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

GitBox Tue, 17 Jan 2023 08:14:38 -0800


findepi commented on code in PR #6582:
URL: https://github.com/apache/iceberg/pull/6582#discussion_r1072412666



##########
core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java:
##########
@@ -26,4 +26,6 @@ private StandardBlobTypes() {}
    * href="https://datasketches.apache.org/";>Apache DataSketches</a> library
    */
   public static final String APACHE_DATASKETCHES_THETA_V1 = 
"apache-datasketches-theta-v1";
+
+  public static final String NDV_BLOB = "ndv-blob";

Review Comment:
   > Spark doesn't use Apache DataSketches to collect approximate NDV
   
   Same was true for Trino. Trino uses HLL by default.
   I introduced DataSketches Theta aggregation so that we can be compatible. 
For that I had to revamped stats collection SPI so that connectors can request 
desired sketch format, as previously it was hard-coded: if a connector wants 
NDV information they get a HLL sketch.
   
   For NDV information, without update'able sketch, we shouldn't use a blob at 
all. The NDV number is just a property of actual updateabke sketch stored in 
the Puffin file. For a POC, you can use a fake blob with empty content, and 
associated NDV number as its property. Just give it some blob type name that 
reveals it's a fake temporary. For production use, we want to write Theta 
sketches.
   
   cc @rdblue 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] findepi commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

Reply via email to