ajantha-bhat opened a new pull request, #6267: URL: https://github.com/apache/iceberg/pull/6267
Note that this is a DRAFT PR, I just wanted to use it for discussions. Once we agree on the changes. I can optimize and add the test cases. Background: 1. We have a plan to reuse the statistics file between two snapshots in case of rewrite data files. But there is no interface to get the current statistics file for the current snapshot. [Spec](https://github.com/apache/iceberg/blob/master/format/spec.md#table-statistics) has a snapshot id in two places, one in `StatisticsFile` and another in its `blob metadata`. To support the reuse of statistics files, we should have the referenced snapshot id in `StatisticsFile`, not the computed-from snapshot id. Hence, updated the spec. Note that PR https://github.com/apache/iceberg/pull/6090 is stuck because of confusion around stats file reuse. 2. Added an interface to get the current statistics file for the current snapshot. This can return null when the stats are not written by the writer for the latest snapshot. 3. Added an interface to get the checkpoint-snapshot-id for the table. It can return a snapshot id for which the stats file was successfully written. It can return -1 if none of the snapshots has a stats file written. Later when we introduce an async way of generating stats using ANALYZE TABLE or CALL procedure. Stats generation can use inputs from these APIs to know whether to compute stats from the beginning or from the checkpoint. cc: @rdblue, @findepi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org