rdblue opened a new pull request, #12781: URL: https://github.com/apache/iceberg/pull/12781
This updates the spec to better handle upgrading tables to v3 and to be simpler. Before, the spec stated that row IDs are assigned the first time a row is modified. Now row IDs are assigned for all added and existing rows in the first snapshot created after a table is upgraded to v3. This ensures that rows have IDs after the first commit to a branch. In order to assign row IDs to existing files, this updates the inheritance rules: 1. Any live data file without `first_row_id` will be assigned one (ADDED or EXISTING) via inheritance 2. The `first_row_id` assigned for a data file is the manifest's `first_row_id` plus sum(record_count) for any other data file assigned before the data file 3. Any manifest without `first_row_id` in the manifest list must be assigned one at write time 4. The `first_row_id` assigned to a manifest is the snapshot's `first-row-id` plus sum(existing_row_count + added_row_count) for any other manifest assigned before the manifest I think that leaving ID space for both existing and added rows makes the feature simpler: any unassigned data file or data manifest will be assigned. That way we don't need to try to track whether existing files were inadvertently assigned a `first_row_id` when only added data files should be. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org