amogh-jahagirdar commented on code in PR #12580: URL: https://github.com/apache/iceberg/pull/12580#discussion_r2017717508
########## format/spec.md: ########## @@ -458,11 +457,11 @@ The snapshot then populates the total number of `added-rows` based on the sum of When the new snapshot is committed, the table's `next-row-id` must also be updated (even if the new snapshot is not in the main branch). Because 225 rows were added (`added1`: 100 + `added2`: 0 + `added3`: 125), the new value is 1,000 + 225 = 1,225: -##### Enabling Row Lineage for Non-empty Tables +##### Row Lineage for Non-empty, Upgraded Tables Any snapshot without the field `first-row-id` does not have any lineage information and values for `_row_id` and `_last_updated_sequence_number` cannot be assigned accurately. -All files that were added before `row-lineage` was enabled should propagate null for all of the `row-lineage` related +All files that were added before upgrading to v3 should propagate null for all of the `row-lineage` related Review Comment: I think the `should` here in `should propogate null` has to be a `must` if I'm not mistaken? To not leave any room for implementations to propogate any non-null values for files that didn't have lineage. ########## format/spec.md: ########## @@ -408,16 +406,17 @@ When `null`, a row's `_row_id` field is assigned to the `first_row_id` from its Values for `_row_id` and `_last_updated_sequence_number` are either read from the data file or assigned at read time. As a result on read, rows in a table always have non-null values for these fields when lineage is enabled. -When an existing row is moved to a different data file for any reason, writers are required to write `_row_id` and `_last_updated_sequence_number` according to the following rules: +When an existing row is moved to a different data file for any reason, writers should write `_row_id` and `_last_updated_sequence_number` according to the following rules: 1. The row's existing non-null `_row_id` must be copied into the new data file 2. If the write has modified the row, the `_last_updated_sequence_number` field must be set to `null` (so that the modification's sequence number replaces the current value) 3. If the write has not modified the row, the existing non-null `_last_updated_sequence_number` value must be copied to the new data file +Engines may model operations as deleting rows and inserting rows or as modifications to rows that preserve row ids. Review Comment: I think we'd still need to call out somewhere that "_last_updated_sequence_number" should be preserved if a row is unmodified in the case an operation is not modeled as a delete + insert, and an exisitng row is moved to a different file. The section above on line 391 used to say that, but I think now it's removed. ########## format/spec.md: ########## @@ -458,11 +457,11 @@ The snapshot then populates the total number of `added-rows` based on the sum of When the new snapshot is committed, the table's `next-row-id` must also be updated (even if the new snapshot is not in the main branch). Because 225 rows were added (`added1`: 100 + `added2`: 0 + `added3`: 125), the new value is 1,000 + 225 = 1,225: -##### Enabling Row Lineage for Non-empty Tables +##### Row Lineage for Non-empty, Upgraded Tables Review Comment: Minor: I feel like we could just leave it as `Row Lineage for Upgraded Tables`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org