RussellSpitzer commented on code in PR #12781: URL: https://github.com/apache/iceberg/pull/12781#discussion_r2040478827
########## format/spec.md: ########## @@ -450,21 +448,24 @@ Within `added1`, the first added manifest, each data file's `first_row_id` follo The `first_row_id` of the EXISTING file `data1` was already assigned, so the file metadata was copied into manifest `added1`. -Files `data2` and `data3` are written with `null` for `first_row_id` and are assigned `first_row_id` at read time based on the manifest's `first_row_id` and the `record_count` of previously listed ADDED files in this manifest: (1,000 + 0) and (1,000 + 50). +Files `data2` and `data3` are written with `null` for `first_row_id` and are assigned `first_row_id` at read time based on the manifest's `first_row_id` and the `record_count` of previously files without `first_row_id` in this manifest: (1,000 + 0) and (1,000 + 50). The snapshot then populates the total number of `added-rows` based on the sum of all added rows in the manifests: 100 (50 + 50) -When the new snapshot is committed, the table's `next-row-id` must also be updated (even if the new snapshot is not in the main branch). Because 225 rows were added (`added1`: 100 + `added2`: 0 + `added3`: 125), the new value is 1,000 + 225 = 1,225: +When the new snapshot is committed, the table's `next-row-id` must also be updated (even if the new snapshot is not in the main branch). Because 375 rows were in data files in manifests that were assigned a `first_row_id` (`added1` 100+25, `added2` 0+100, `added3` 125+25) the new value is 1,000 + 375 = 1,375. ##### Row Lineage for Upgraded Tables -Any snapshot without the field `first-row-id` does not have any lineage information and values for `_row_id` and `_last_updated_sequence_number` cannot be assigned accurately. +When a table is upgraded to v3, existing snapshots are not modified and do not have `first-row-id` set. For such snapshots without `first-row-id`, `first_row_id` values for data files and data manifests are null, and values for `_row_id` are read as null for all rows. When `first_row_id` is null, inherited row ID values are also null. + +Snapshots that are created after upgrading to v3 must set the snapshot's `first-row-id` and assign row IDs to existing and added files in the snapshot. When writing the manifest list, all data manifests must be assigned a `first_row_id`, which assigns a `first_row_id` to all data files via inheritance. + +Note that: -All files that were added before upgrading to v3 must propagate null for all row-lineage related -fields. The values for `_row_id` and `_last_updated_sequence_number` must always return null and when these rows are copied, -null must be explicitly written. After this point, rows are treated as if they were just created -and assigned `row_id` and `_last_updated_sequence_number` as if they were new rows. +* Snapshots from before upgrading to v3 do not have row IDs. Review Comment: So does this mean that we aren't going to do any work on upgrade to assign new row Id's to existing snapshots? We'll only do this on the creation of future snapshots created after the upgrade? I feel like we can handle this now by just setting next-row-id on all snapshots using the logic below. Then all previous snapshots can get row_id -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org