findepi opened a new issue, #6442:
URL: https://github.com/apache/iceberg/issues/6442

   ### Feature Request / Improvement
   
   Currently `UpdateStatistics` 
(`org.apache.iceberg.Transaction#updateStatistics`) allows adding statistics 
for an existing snapshot.
   As a result, it is currently not possible publish a snapshot with statistics 
already collected.
   
   Collecting statistics for an existing data is definitely an important 
use-case (like Trino's ANALYZE),
   but some query engines (like Trino) can collect stats on the fly, when 
writing to a table (INSERT, CREATE TABLE AS ...).
   
   It's not difficult to 
   
   - publish data change snapshot (adding new files)
   - take a note of new snapshot ID
   - add statistics for that snapshot
   
   however this has some drawbacks
   
   - new data is published without stats, so other queries can be planned 
sub-optimally, leading to eg improper use of cluster resources, or even 
unexpected query failures (if data changed significantly)
   - someone may run ANALYZE on the new snapshot (unknowingly or 
intentionally), and this will end up with two different threads wanting to add 
stats  to it -- wasted work
   
   
   We should make it possible to publish data change together with new stats.
   This may will require API changes
   It may also require spec changes, if we want to use "inherit snapshot ID" 
model.
   (Maybe we don't have to, since stats are in metadata?)
   
   ### Query engine
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to