ajantha-bhat opened a new pull request, #6267:
URL: https://github.com/apache/iceberg/pull/6267

   Note that this is a DRAFT PR, 
   I just wanted to use it for discussions. 
   Once we agree on the changes. I can optimize and add the test cases. 
   
   Background:
   1. We have a plan to reuse the statistics file between two snapshots in case 
of rewrite data files. 
   But there is no interface to get the current statistics file for the current 
snapshot. 
   
[Spec](https://github.com/apache/iceberg/blob/master/format/spec.md#table-statistics)
 has a snapshot id in two places, one in `StatisticsFile` and another in its 
`blob metadata`. 
   To support the reuse of statistics files, we should have the referenced 
snapshot id in `StatisticsFile`, not the computed-from snapshot id. Hence, 
updated the spec. 
   
   Note that PR https://github.com/apache/iceberg/pull/6090 is stuck because of 
confusion around stats file reuse. 
   
   2. Added an interface to get the current statistics file for the current 
snapshot. This can return null when the stats are not written by the writer for 
the latest snapshot. 
   
   3. Added an interface to get the checkpoint-snapshot-id for the table. It 
can return a snapshot id for which the stats file was successfully written. It 
can return -1 if none of the snapshots has a stats file written. 
   
   Later when we introduce an async way of generating stats using ANALYZE TABLE 
or CALL procedure. 
   Stats generation can use inputs from these APIs to know whether to compute 
stats from the beginning or from the checkpoint.
   
   cc: @rdblue, @findepi  
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to