findepi commented on code in PR #6582: URL: https://github.com/apache/iceberg/pull/6582#discussion_r1085085232
########## core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java: ########## @@ -26,4 +26,6 @@ private StandardBlobTypes() {} * href="https://datasketches.apache.org/">Apache DataSketches</a> library */ public static final String APACHE_DATASKETCHES_THETA_V1 = "apache-datasketches-theta-v1"; + + public static final String NDV_BLOB = "ndv-blob"; Review Comment: > Does Trino update the NDV sketch every time a write happens? Not yet, but there is a WIP PR for that: https://github.com/trinodb/trino/pull/15441 > What if a table is wrote both by Trino and Spark? I believe the update from Spark side will be missing in that case. That's unfortunately true, but we hope this is just a temporary limitation. I would feel uncomfortably advising users not to use Spark because it cannot update Iceberg stats properly. > A-synchronized operation like this procedure. You need this anyway, since not all writes will update stats. For example, it's quite hard to updates NDV stats for a deletion (was this the _only_ appearance of a value, or one of many?) Trino provides ANALYZE statement to analyze stats for a table, and it currently computes a Theta sketch and puts it in Iceberg table's Puffin stats file. Automatic stats update will work well for append-only tables that do not undergo deletions or updates (or have deletions and updates in ignorable quantity). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org