Re: [PR] Spec: Adds Row Lineage [iceberg]

via GitHub Thu, 26 Sep 2024 15:08:16 -0700


rdblue commented on code in PR #11130:
URL: https://github.com/apache/iceberg/pull/11130#discussion_r1777781725



##########
format/spec.md:
##########
@@ -298,16 +298,137 @@ Iceberg tables must not use field ids greater than 
2147483447 (`Integer.MAX_VALU
 
 The set of metadata columns is:
 
-| Field id, name              | Type          | Description |
-|-----------------------------|---------------|-------------|
-| **`2147483646  _file`**     | `string`      | Path of the file in which a 
row is stored |
-| **`2147483645  _pos`**      | `long`        | Ordinal position of a row in 
the source data file |
-| **`2147483644  _deleted`**  | `boolean`     | Whether the row has been 
deleted |
-| **`2147483643  _spec_id`**  | `int`         | Spec ID used to track the file 
containing a row |
-| **`2147483642  _partition`** | `struct`     | Partition to which a row 
belongs |
-| **`2147483546  file_path`** | `string`      | Path of a file, used in 
position-based delete files |
-| **`2147483545  pos`**       | `long`        | Ordinal position of a row, 
used in position-based delete files |
-| **`2147483544  row`**       | `struct<...>` | Deleted row values, used in 
position-based delete files |
+| Field id, name                    | Type          | Description              
                                                     |
+|-----------------------------------|---------------|-------------------------------------------------------------------------------|
+| **`2147483646  _file`**           | `string`      | Path of the file in 
which a row is stored                                     |
+| **`2147483645  _pos`**            | `long`        | Ordinal position of a 
row in the source data file, starting at `0`            |
+| **`2147483644  _deleted`**        | `boolean`     | Whether the row has been 
deleted                                              |
+| **`2147483643  _spec_id`**        | `int`         | Spec ID used to track 
the file containing a row                               |
+| **`2147483642  _partition`**      | `struct`      | Partition to which a row 
belongs                                              |
+| **`2147483546  file_path`**       | `string`      | Path of a file, used in 
position-based delete files                           |
+| **`2147483545  pos`**             | `long`        | Ordinal position of a 
row, used in position-based delete files                |
+| **`2147483544  row`**             | `struct<...>` | Deleted row values, used 
in position-based delete files                       |
+| **`2147483545  _row_identifier`** | `long`        | A unique long assigned 
when row-lineage is enabled see [Row Lineage](#row-lineage) |
+| **`2147483545  _last_update`**    | `long`        | The sequence number 
which last updated this row when row-lineage is enabled [Row 
Lineage](#row-lineage)  |
+
+### Row Lineage
+
+In Specification V3, an Iceberg Table can declare that engines must track 
row-lineage of all newly created rows. This
+requirement is controlled by setting the field `row-lineage` to true in the 
table's metadata. When true, two additional 
+fields in data files will be available for all rows added to the table.
+
+* `_row_identifier` a unique long for every row. Computed via inheritance for 
rows in their original datafiles 
+and explicitly written when the row is moved to a new file. The values are 
monotonically increasing but can be sparse.
+* `_last_update` the sequence number of the commit which last updated this 
row. The value is computed via inheritance for 
+rows in their original file or in files where the row was modified. The value 
is explicitly populated when the row is moved 
+to a new file without modification.
+
+Commits with `row-lineage` enabled are not allowed to include any [Equality 
Deletes](#equality-delete-files).
+
+Implementations writing to tables where `row-lineage` is enabled must populate 
several additional 
+fields in the metadata and propagate row information from existing and 
updated. 
+
+#### Metadata Propagation
+
+Creating a new commit when `row-lineage` is enabled requires
+* Setting the new snapshot's `first-used-identifier` field to the previous 
table metadata's `last-used-identifier` field
+* Setting `first_used_identifier` in new data manifest entries in the new 
manifest list to `first-used-identifier` field 
+from the snapshot plus the sum of all `added_rows_count` in previously listed 
new data manifest entries.
+* Setting `first_used_identifier` in new `data_file` entries  to null and in 
existing `data_file` structs to their original
+or computed value (if previously null) of `first_used_identifier`
+* Incrementing the new metadata's `last-used-identifier` field by the number 
of new rows added to the table by the commit.
+
+When reading a `data_file` manifest entry with a null `first_used_identifier`, 
the value is calculated as the 
+sum of the `total_records` of every previous added `data_file` entry summed 
with the`first_used_identifier` of the manifest. 
+
+
+An example of producing a new commit with `row-lineage` true, 
+
+Given a original metadata
+
+```json
+{
+  "row-lineage": true,
+  "last-used-identifier": 1000
+}
+```
+
+A new snapshot would be created 
+```json
+{
+  "first-used-identifier": 1000
+}
+```
+
+Which would point to a manifest-list which calculates `first_used_identifier` 
for each manifest based on total number of 
+added rows in all previous manifests as ordered within the manifest-list
+
+| `manifest_path` | `added_rows_count` | `existing_rows_count` | 
`first_used_identifier` |
+|---------------|--------------------|-----------------------|-------------------------|
+| path_1        | 250                | 75                    | 1000            
        |
+| path_2        | 225                | 25                    | 1250            
        |
+| path_3        | 0                  | 100                   | 1475            
        |
+| path_4        | 125                | 25                    | 1475            
        |
+
+In the above example, path_3 can be ignored from the calculation of the next 
manifest's `first_used_identifier` because
+no rows were added to the table. The value of `first_used_identifier` is 
computed as the sum of all  `added_rows_count` 
+values from manifests already listed in the manifest list added to the 
snapshot's `first-used-identifier`.
+
+The new `data_file` entries within these manifests have their  
`first-used-identifier` field set to null and on read 
+should have the value computed using inheritance from their manifest's 
`first_used_identifier` value. Any existing
+`data_file` which is written into the manifest should have its value for 
`first_used_identifier` copied directly.
+
+The manifest path_1 could have the following entries (imputed values shown in 
parentheses)
+
+| `file_path` | `record_count` | `first_used_identifier` |
+|-------------|----------------|-------------------------|
+| data_1      | 100            | null (1000)         |
+| data_2      | 75             | 800                     |
+| data_3      | 150            | null (1100)             |
+
+Two newly added files, data_1 and data_3, have `first_used_identifier` set to 
null. This allows writers to add new
+data files to inherit the values for the total number of added rows from other 
data_files as well as the snapshot's
+`first_used_identifier`. When reading the `first_used_identifier` column, the 
value of `first_used_identifier` from the 
+manifest  entry is combined with the sum of `record_count` for all previous 
`data_file` structs that were added in the 
+manifest.
+
+The existing file, data_2, came from a previous table state (not included in 
this example) and has it's inherited value
+for `first_used_identifier` written out explicitly into the `data_file` struct 
since it can no longer be computed in 
+the context of the current manifest-list.
+
+Finally, the new table metadata must update `last-used-identifier` to the new 
highest possible value. In this example, 
+the last largest value for `first_used_identifier` in the manifest-list plus 
the `added_rows_count` for that manifest 
+(path_4), 1475 + 125  = 1600.
+
+```json
+{
+  "last-used-identifier": 1600
+}
+```
+
+#### Datafile Propagation
+
+New data files added when `row-lineage` is enabled do not require any 
modification. The columns for `_row_identifiier`

Review Comment:
   What about existing rows that are added to these files? I don't think this 
quite works unless all of the rows are new and get new IDs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spec: Adds Row Lineage [iceberg]

Reply via email to