rdblue opened a new pull request, #12781:
URL: https://github.com/apache/iceberg/pull/12781

   This updates the spec to better handle upgrading tables to v3 and to be 
simpler.
   
   Before, the spec stated that row IDs are assigned the first time a row is 
modified. Now row IDs are assigned for all added and existing rows in the first 
snapshot created after a table is upgraded to v3. This ensures that rows have 
IDs after the first commit to a branch.
   
   In order to assign row IDs to existing files, this updates the inheritance 
rules:
   1. Any live data file without `first_row_id` will be assigned one (ADDED or 
EXISTING) via inheritance
   2. The `first_row_id` assigned for a data file is the manifest's 
`first_row_id` plus sum(record_count) for any other data file assigned before 
the data file
   3. Any manifest without `first_row_id` in the manifest list must be assigned 
one at write time
   4. The `first_row_id` assigned to a manifest is the snapshot's 
`first-row-id` plus sum(existing_row_count + added_row_count)  for any other 
manifest assigned before the manifest
   
   I think that leaving ID space for both existing and added rows makes the 
feature simpler: any unassigned data file or data manifest will be assigned. 
That way we don't need to try to track whether existing files were 
inadvertently assigned a `first_row_id` when only added data files should be.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to