Re: [PR] Spec: Update row lineage requirements for upgrading tables [iceberg]

via GitHub Mon, 14 Apr 2025 09:04:12 -0700


rdblue commented on code in PR #12781:
URL: https://github.com/apache/iceberg/pull/12781#discussion_r2042460867



##########
format/spec.md:
##########
@@ -450,21 +448,24 @@ Within `added1`, the first added manifest, each data 
file's `first_row_id` follo
 
 The `first_row_id` of the EXISTING file `data1` was already assigned, so the 
file metadata was copied into manifest `added1`.
 
-Files `data2` and `data3` are written with `null` for `first_row_id` and are 
assigned `first_row_id` at read time based on the manifest's `first_row_id` and 
the `record_count` of previously listed ADDED files in this manifest: (1,000 + 
0) and (1,000 + 50).
+Files `data2` and `data3` are written with `null` for `first_row_id` and are 
assigned `first_row_id` at read time based on the manifest's `first_row_id` and 
the `record_count` of previously files without `first_row_id` in this manifest: 
(1,000 + 0) and (1,000 + 50).
 
 The snapshot then populates the total number of `added-rows` based on the sum 
of all added rows in the manifests: 100 (50 + 50)
 
-When the new snapshot is committed, the table's `next-row-id` must also be 
updated (even if the new snapshot is not in the main branch). Because 225 rows 
were added (`added1`: 100 + `added2`: 0 + `added3`: 125), the new value is 
1,000 + 225 = 1,225:
+When the new snapshot is committed, the table's `next-row-id` must also be 
updated (even if the new snapshot is not in the main branch). Because 375 rows 
were in data files in manifests that were assigned a `first_row_id` (`added1` 
100+25, `added2` 0+100, `added3` 125+25) the new value is 1,000 + 375 = 1,375.
 
 
 ##### Row Lineage for Upgraded Tables 
 
-Any snapshot without the field `first-row-id` does not have any lineage 
information and values for `_row_id` and `_last_updated_sequence_number` cannot 
be assigned accurately.  
+When a table is upgraded to v3, existing snapshots are not modified and do not 
have `first-row-id` set. For such snapshots without `first-row-id`, 
`first_row_id` values for data files and data manifests are null, and values 
for `_row_id` are read as null for all rows. When `first_row_id` is null, 
inherited row ID values are also null.
+
+Snapshots that are created after upgrading to v3 must set the snapshot's 
`first-row-id` and assign row IDs to existing and added files in the snapshot. 
When writing the manifest list, all data manifests must be assigned a 
`first_row_id`, which assigns a `first_row_id` to all data files via 
inheritance.
+
+Note that:
 
-All files that were added before upgrading to v3 must propagate null for all 
row-lineage related
-fields. The values for `_row_id` and `_last_updated_sequence_number` must 
always return null and when these rows are copied, 
-null must be explicitly written. After this point, rows are treated as if they 
were just created 
-and assigned `row_id` and `_last_updated_sequence_number` as if they were new 
rows.
+* Snapshots from before upgrading to v3 do not have row IDs.

Review Comment:
   Yes, I don't think that we should add row lineage for older snapshots 
because it creates a lot of problems that we don't have good ways to solve and 
doesn't create much, if any, value.
   
   Here are some of the problems:
   * To add row IDs to older snapshots, we would need to change the snapshot 
metadata. This could be cached so changing it may cause strange issues, but the 
larger problem is that modifying it would require coordination (to avoid losing 
the changes) and changes to the REST protocol to replace an older snapshot
   * If we were to add row IDs to older snapshots, then those row IDs would not 
be very useful without an expensive operation that rewrites the whole metadata 
tree.
       * If we assigned new row IDs independently in each snapshot, then a data 
file would have different row IDs across versions -- making the IDs useless.
       * That means we would need to track data files and update the metadata 
tree with consistent `first_row_id` values.
       * Even with the expensive rewrite of the whole metadata tree, the 
`_row_id` column would be missing from data files so the lineage of individual 
rows is not available.
   
   I think the best approach is to not modify older snapshots. The goal with 
this update is to ensure that we have good row lineage from the next snapshot 
after upgrade and forward, when we can write `_row_id`. That's why I updated 
the `first_row_id` assignment to include any unassigned data file or data 
manifest, and changed the assignment strategy to leave space for existing rows. 
With those changes we have created an invariant: new snapshots in v3 tables 
always have IDs assigned to all rows.
   
   Branching is still a little difficult because we don't want to analyze 
branch history (which may be lost) and attempt to assign consistently. With 
this update, branches get separate IDs but it leave open the possibility to do 
some external analysis and assign the same IDs for the same data files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spec: Update row lineage requirements for upgrading tables [iceberg]

Reply via email to