Re: [PR] Spec: Update row lineage requirements for upgrading tables [iceberg]

via GitHub Thu, 17 Apr 2025 10:53:02 -0700


szehon-ho commented on code in PR #12781:
URL: https://github.com/apache/iceberg/pull/12781#discussion_r2049405529



##########
format/spec.md:
##########
@@ -732,12 +738,10 @@ Valid snapshots are stored as a list in table metadata. 
For serialization, see A
 
 #### Snapshot Row IDs
 
-A snapshot's `first-row-id` is assigned to the table's current `next-row-id` 
on each commit attempt. If a commit is retried, the `first-row-id` must be 
reassigned. If a commit contains no new rows, `first-row-id` should be omitted.
+A snapshot's `first-row-id` is assigned to the table's current `next-row-id` 
on each commit attempt. If a commit is retried, the `first-row-id` must be 
reassigned based on the table's current `next-row-id`. The `first-row-id` field 
is required even if a commit does not assign any ID space.

Review Comment:
   Note: this seems a good clarification, if i understand this correctly, the 
next-row-id could be the same even on the next attempt, before it seemed like 
it implied it needed to be different.



##########
format/spec.md:
##########
@@ -450,21 +452,24 @@ Within `added1`, the first added manifest, each data 
file's `first_row_id` follo
 
 The `first_row_id` of the EXISTING file `data1` was already assigned, so the 
file metadata was copied into manifest `added1`.
 
-Files `data2` and `data3` are written with `null` for `first_row_id` and are 
assigned `first_row_id` at read time based on the manifest's `first_row_id` and 
the `record_count` of previously listed ADDED files in this manifest: (1,000 + 
0) and (1,000 + 50).
+Files `data2` and `data3` are written with `null` for `first_row_id` and are 
assigned `first_row_id` at read time based on the manifest's `first_row_id` and 
the `record_count` of previous files without `first_row_id` in this manifest: 
(1,000 + 0) and (1,000 + 50).
 
 The snapshot then populates the total number of `added-rows` based on the sum 
of all added rows in the manifests: 100 (50 + 50)
 
-When the new snapshot is committed, the table's `next-row-id` must also be 
updated (even if the new snapshot is not in the main branch). Because 225 rows 
were added (`added1`: 100 + `added2`: 0 + `added3`: 125), the new value is 
1,000 + 225 = 1,225:
+When the new snapshot is committed, the table's `next-row-id` must also be 
updated (even if the new snapshot is not in the main branch). Because 375 rows 
were in data files in manifests that were assigned a `first_row_id` (`added1` 
100+25, `added2` 0+100, `added3` 125+25) the new value is 1,000 + 375 = 1,375.
 
 
 ##### Row Lineage for Upgraded Tables 
 
-Any snapshot without the field `first-row-id` does not have any lineage 
information and values for `_row_id` and `_last_updated_sequence_number` cannot 
be assigned accurately.  
+When a table is upgraded to v3, its `next-row-id` is initailized to 0 and 
existing snapshots are not modified (that is, `first-row-id` remains unset or 
null). For such snapshots without `first-row-id`, `first_row_id` values for 
data files and data manifests are null, and values for `_row_id` are read as 
null for all rows. When `first_row_id` is null, inherited row ID values are 
also null.
+
+Snapshots that are created after upgrading to v3 must set the snapshot's 
`first-row-id` and assign row IDs to existing and added files in the snapshot. 
When writing the manifest list, all data manifests must be assigned a 
`first_row_id`, which assigns a `first_row_id` to all data files via 
inheritance.
+
+Note that:
 
-All files that were added before upgrading to v3 must propagate null for all 
row-lineage related
-fields. The values for `_row_id` and `_last_updated_sequence_number` must 
always return null and when these rows are copied, 
-null must be explicitly written. After this point, rows are treated as if they 
were just created 
-and assigned `row_id` and `_last_updated_sequence_number` as if they were new 
rows.
+* Snapshots created before upgrading to v3 do not have row IDs.
+* After upgrading, new snapshots in different branches will assign disjoint ID 
ranges to existing data files, based on the table's `next-row-id` when the 
snapshot is committed. For a data file in multiple branches, a writer may write 
the `first_row_id` from another branch or may assign a new `first_row_id` to 
the data file (to avoid large metadata rewrites).

Review Comment:
   if i understand, this means if same data file exists on different branch, 
can have same first_row_id?  It seems the two sentence contradict (first 
sentence specifies disjoint Id ranges).  Would it be more clear:
   
   > After upgrading, new snapshots in different branches will assign disjoint 
ID ranges to existing **distinct** data files, based on the table's 
`next-row-id` when the snapshot is committed. For the **same** data file in 
multiple branches...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spec: Update row lineage requirements for upgrading tables [iceberg]

Reply via email to