rdblue commented on code in PR #11130: URL: https://github.com/apache/iceberg/pull/11130#discussion_r1777761549
########## format/spec.md: ########## @@ -298,16 +298,143 @@ Iceberg tables must not use field ids greater than 2147483447 (`Integer.MAX_VALU The set of metadata columns is: -| Field id, name | Type | Description | -|-----------------------------|---------------|-------------| -| **`2147483646 _file`** | `string` | Path of the file in which a row is stored | -| **`2147483645 _pos`** | `long` | Ordinal position of a row in the source data file | -| **`2147483644 _deleted`** | `boolean` | Whether the row has been deleted | -| **`2147483643 _spec_id`** | `int` | Spec ID used to track the file containing a row | -| **`2147483642 _partition`** | `struct` | Partition to which a row belongs | -| **`2147483546 file_path`** | `string` | Path of a file, used in position-based delete files | -| **`2147483545 pos`** | `long` | Ordinal position of a row, used in position-based delete files | -| **`2147483544 row`** | `struct<...>` | Deleted row values, used in position-based delete files | +| Field id, name | Type | Description | +|----------------------------------|---------------|---------------------------------------------------------------------------------------------------------| +| **`2147483646 _file`** | `string` | Path of the file in which a row is stored | +| **`2147483645 _pos`** | `long` | Ordinal position of a row in the source data file, starting at `0` | +| **`2147483644 _deleted`** | `boolean` | Whether the row has been deleted | +| **`2147483643 _spec_id`** | `int` | Spec ID used to track the file containing a row | +| **`2147483642 _partition`** | `struct` | Partition to which a row belongs | +| **`2147483546 file_path`** | `string` | Path of a file, used in position-based delete files | +| **`2147483545 pos`** | `long` | Ordinal position of a row, used in position-based delete files | +| **`2147483544 row`** | `struct<...>` | Deleted row values, used in position-based delete files | +| **`2147483543 _row_id`** | `long` | A unique long assigned when row-lineage is enabled see [Row Lineage](#row-lineage) | +| **`2147483542 _last_update`** | `long` | The sequence number which last updated this row when row-lineage is enabled [Row Lineage](#row-lineage) | + +### Row Lineage + +In Specification V3, an Iceberg Table can declare that engines must track row-lineage of all newly created rows. This +requirement is controlled by setting the field `row-lineage` to true in the table's metadata. When true, two additional +fields in data files will be available for all rows added to the table. + +* `_row_id` a unique long for every row. Computed via inheritance for rows in their original datafiles +and explicitly written when the row is moved to a new file. +* `_last_update` the sequence number of the commit which last updated this row. The value is computed via inheritance for +rows in their original file or in files where the row was modified. + +Commits with `row-lineage` enabled are not allowed to include any [Equality Deletes](#equality-delete-files). + +Implementations writing to tables where `row-lineage` is enabled must populate several additional +fields in the metadata and propagate row information from existing and updated. + +#### Metadata Propagation + +Creating a new commit when `row-lineage` is enabled requires +* Setting the new snapshot's `first-row-id` field to the previous table metadata's `last-row-id` field +* Setting `first_row_id` in new data manifest entries in the new manifest list to `first-row-id` field +from the snapshot plus the sum of all `added_rows_count` in previously listed new data manifest entries. +* Setting `first_row_id` in new `data_file` entries to null and in existing `data_file` structs to their original +or computed value (if previously null) of `first_row_id` +* Incrementing the new metadata's `last-row-id` field by the number of new rows added to the table by the commit. + +When reading a `data_file` manifest entry with a null `first_row_id`, the value is calculated as the +sum of the `total_records` of every previous added `data_file` entry summed with the`first_row_id` of the manifest. Review Comment: I think this conflicts with the requirement above, which states that the `last-row-id` is incremented by the number of new rows. I'm assuming that by `total_records` you mean the data file's `record_count` (which is what I suggested using above). If that's the case, then the `first_row_id` of a data file could exceed the `last-row-id` of the commit. For example, if a commit rewrites two data files, A and B (each 100 rows), and adds just one new row to the resulting A2 and B2 files (each 101 rows), then `last-row-id` would be `previous-last-row-id + 2`. But the `start_row_id` of file B would be `previous-last-row-id + 101 (A2#record_count)`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org