findepi commented on code in PR #6582:
URL: https://github.com/apache/iceberg/pull/6582#discussion_r1072412666
##########
core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java:
##########
@@ -26,4 +26,6 @@ private StandardBlobTypes() {}
* href="https://datasketches.apache.org/">Apache DataSketches</a> library
*/
public static final String APACHE_DATASKETCHES_THETA_V1 =
"apache-datasketches-theta-v1";
+
+ public static final String NDV_BLOB = "ndv-blob";
Review Comment:
> Spark doesn't use Apache DataSketches to collect approximate NDV
Same was true for Trino. Trino uses HLL by default.
I introduced DataSketches Theta aggregation so that we can be compatible.
For that I had to revamped stats collection SPI so that connectors can request
desired sketch format, as previously it was hard-coded: if a connector wants
NDV information they get a HLL sketch.
For NDV information, without update'able sketch, we shouldn't use a blob at
all. The NDV number is just a property of actual updateabke sketch stored in
the Puffin file. For a POC, you can use a fake blob with empty content, and
associated NDV number as its property. Just give it some blob type name that
reveals it's a fake temporary. For production use, we want to write Theta
sketches.
cc @rdblue
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]