Re: [PR] Spec: Update v3 summary, add row lineage [iceberg]

via GitHub Tue, 06 May 2025 15:10:46 -0700


rdblue commented on code in PR #12982:
URL: https://github.com/apache/iceberg/pull/12982#discussion_r2076417558



##########
format/spec.md:
##########
@@ -1680,6 +1680,25 @@ Row-level delete changes:
     * These position delete files must be merged into the DV for a data file 
when one is created
     * Position delete files that contain deletes for more than one data file 
need to be kept in table metadata until all deletes are replaced by DVs
 
+Row lineage changes:
+
+* Writers must set the table's `next-row-id` and use the existing 
`next-row-id` as the `first-row-id` to create a new snapshot
+    * When a table is upgraded to v3, `next_row_id` should be initialized to 0
+    * When committing a new snapshot `next-row-id` must be incremented by at 
least the number of newly assigned row ids in the snapshot
+    * It is recommended to increment `next-row-id` by the total 
`added_rows_count` and `existing_rows_count` of all manifests assigned a 
`first_row_id`
+* Writers must assign a `first_row_id` to new data manifests when writing a 
manifest list
+    * When writing a new manifest list each `first_row_id` must be incremented 
by at least the number of newly assigned row ids in each manifest
+    * It is recommended to increment `first_row_id` by a manifest's 
`added_rows_count` and `existing_rows_count`
+* Readers must assign a `first_row_id` for any data file that does not have an 
assigned value written in a manifest
+    * Readers must increment `first_row_id` by the data file's `record_count`
+* When writing an existing data file into a new manifest, its `first_row_id` 
must be written into the manifest
+* When a data file has a non-null `first_row_id`, readers must:
+    * Replace any null or missing `_row_id` with the data file's 
`first_row_id` plus the row's `_pos`
+    * Replace any null or missing `_last_updated_sequence_number` to the data 
file's `data_sequence_number`
+    * Read any non-null `_row_id` or `_last_updated_sequence_number` without 
modification
+* When a data file has a null `first_row_id`, readers must produce null for 
`_row_id` and `_last_updated_sequence_number`

Review Comment:
   I took another pass and updated for this concern. I added:
   * New data files are written with a null `first_row_id`
   * Inheritance happens when the manifest has a non-null `first_row_id`
   
   I think the requirements under this point specifically are okay. When the 
data file has a `first_row_id`, the value of `_row_id` must never be read as 
null or missing. The missing case is for when the data file is entirely new and 
has no row IDs. And a data file may have only some rows with null `_row_id` 
when they are written by a MERGE statement; we inject a null for inserted rows 
and preserve `_row_id` for the updated rows. Similar logic applies to 
`_last_updated_sequence_number` and the final sub-point is that rows with 
non-null `_row_id` or `_last_updated_sequence_number` should not have those 
values replaced.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spec: Update v3 summary, add row lineage [iceberg]

Reply via email to