findepi opened a new issue, #6442: URL: https://github.com/apache/iceberg/issues/6442
### Feature Request / Improvement Currently `UpdateStatistics` (`org.apache.iceberg.Transaction#updateStatistics`) allows adding statistics for an existing snapshot. As a result, it is currently not possible publish a snapshot with statistics already collected. Collecting statistics for an existing data is definitely an important use-case (like Trino's ANALYZE), but some query engines (like Trino) can collect stats on the fly, when writing to a table (INSERT, CREATE TABLE AS ...). It's not difficult to - publish data change snapshot (adding new files) - take a note of new snapshot ID - add statistics for that snapshot however this has some drawbacks - new data is published without stats, so other queries can be planned sub-optimally, leading to eg improper use of cluster resources, or even unexpected query failures (if data changed significantly) - someone may run ANALYZE on the new snapshot (unknowingly or intentionally), and this will end up with two different threads wanting to add stats to it -- wasted work We should make it possible to publish data change together with new stats. This may will require API changes It may also require spec changes, if we want to use "inherit snapshot ID" model. (Maybe we don't have to, since stats are in metadata?) ### Query engine None -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org