amogh-jahagirdar commented on code in PR #12580:
URL: https://github.com/apache/iceberg/pull/12580#discussion_r2017717508


##########
format/spec.md:
##########
@@ -458,11 +457,11 @@ The snapshot then populates the total number of 
`added-rows` based on the sum of
 When the new snapshot is committed, the table's `next-row-id` must also be 
updated (even if the new snapshot is not in the main branch). Because 225 rows 
were added (`added1`: 100 + `added2`: 0 + `added3`: 125), the new value is 
1,000 + 225 = 1,225:
 
 
-##### Enabling Row Lineage for Non-empty Tables
+##### Row Lineage for Non-empty, Upgraded Tables 
 
 Any snapshot without the field `first-row-id` does not have any lineage 
information and values for `_row_id` and `_last_updated_sequence_number` cannot 
be assigned accurately.  
 
-All files that were added before `row-lineage` was enabled should propagate 
null for all of the `row-lineage` related
+All files that were added before upgrading to v3 should propagate null for all 
of the `row-lineage` related

Review Comment:
   I think the `should` here in `should propogate null` has to be a `must` if 
I'm not mistaken? To not leave any room for implementations to propogate any 
non-null values for files that didn't have lineage.



##########
format/spec.md:
##########
@@ -408,16 +406,17 @@ When `null`, a row's `_row_id` field is assigned to the 
`first_row_id` from its
 
 Values for `_row_id` and `_last_updated_sequence_number` are either read from 
the data file or assigned at read time. As a result on read, rows in a table 
always have non-null values for these fields when lineage is enabled.
 
-When an existing row is moved to a different data file for any reason, writers 
are required to write `_row_id` and `_last_updated_sequence_number` according 
to the following rules:
+When an existing row is moved to a different data file for any reason, writers 
should write `_row_id` and `_last_updated_sequence_number` according to the 
following rules:
 
 1. The row's existing non-null `_row_id` must be copied into the new data file
 2. If the write has modified the row, the `_last_updated_sequence_number` 
field must be set to `null` (so that the modification's sequence number 
replaces the current value)
 3. If the write has not modified the row, the existing non-null 
`_last_updated_sequence_number` value must be copied to the new data file
 
+Engines may model operations as deleting rows and inserting rows or as 
modifications to rows that preserve row ids.

Review Comment:
   I think we'd still need to call out somewhere that 
"_last_updated_sequence_number" should be preserved if a row is unmodified in 
the case an operation is not modeled as a delete + insert, and an exisitng row 
is moved to a different file. 
   
   The section above on line 391 used to say that, but I think now it's 
removed. 



##########
format/spec.md:
##########
@@ -458,11 +457,11 @@ The snapshot then populates the total number of 
`added-rows` based on the sum of
 When the new snapshot is committed, the table's `next-row-id` must also be 
updated (even if the new snapshot is not in the main branch). Because 225 rows 
were added (`added1`: 100 + `added2`: 0 + `added3`: 125), the new value is 
1,000 + 225 = 1,225:
 
 
-##### Enabling Row Lineage for Non-empty Tables
+##### Row Lineage for Non-empty, Upgraded Tables 

Review Comment:
   Minor: I feel like we could just leave it as `Row Lineage for Upgraded 
Tables`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to