Re: [PR] Spec: Adds Row Lineage [iceberg]

via GitHub Fri, 27 Sep 2024 13:23:14 -0700


RussellSpitzer commented on code in PR #11130:
URL: https://github.com/apache/iceberg/pull/11130#discussion_r1779115055



##########
format/spec.md:
##########
@@ -298,16 +298,143 @@ Iceberg tables must not use field ids greater than 
2147483447 (`Integer.MAX_VALU
 
 The set of metadata columns is:
 
-| Field id, name              | Type          | Description |
-|-----------------------------|---------------|-------------|
-| **`2147483646  _file`**     | `string`      | Path of the file in which a 
row is stored |
-| **`2147483645  _pos`**      | `long`        | Ordinal position of a row in 
the source data file |
-| **`2147483644  _deleted`**  | `boolean`     | Whether the row has been 
deleted |
-| **`2147483643  _spec_id`**  | `int`         | Spec ID used to track the file 
containing a row |
-| **`2147483642  _partition`** | `struct`     | Partition to which a row 
belongs |
-| **`2147483546  file_path`** | `string`      | Path of a file, used in 
position-based delete files |
-| **`2147483545  pos`**       | `long`        | Ordinal position of a row, 
used in position-based delete files |
-| **`2147483544  row`**       | `struct<...>` | Deleted row values, used in 
position-based delete files |
+| Field id, name                   | Type          | Description               
                                                                              |
+|----------------------------------|---------------|---------------------------------------------------------------------------------------------------------|
+| **`2147483646  _file`**          | `string`      | Path of the file in which 
a row is stored                                                               |
+| **`2147483645  _pos`**           | `long`        | Ordinal position of a row 
in the source data file, starting at `0`                                      |
+| **`2147483644  _deleted`**       | `boolean`     | Whether the row has been 
deleted                                                                        |
+| **`2147483643  _spec_id`**       | `int`         | Spec ID used to track the 
file containing a row                                                         |
+| **`2147483642  _partition`**     | `struct`      | Partition to which a row 
belongs                                                                        |
+| **`2147483546  file_path`**      | `string`      | Path of a file, used in 
position-based delete files                                                     
|
+| **`2147483545  pos`**            | `long`        | Ordinal position of a 
row, used in position-based delete files                                        
  |
+| **`2147483544  row`**            | `struct<...>` | Deleted row values, used 
in position-based delete files                                                 |
+| **`2147483543  _row_id`**        | `long`        | A unique long assigned 
when row-lineage is enabled see [Row Lineage](#row-lineage)                     
 |
+| **`2147483542  _last_update`**   | `long`        | The sequence number which 
last updated this row when row-lineage is enabled [Row Lineage](#row-lineage) |
+
+### Row Lineage

Review Comment:
   Every "row_id" can be used to track the creation of a row by checking the 
"row_id" high water mark of each snapshot in the table history. This allows a 
user (with sufficent snapshot history) to determine when any particular row was 
initially added to the table. The second field `last-updated-seq` points to the 
update in which the row was last modified. 
   
   Together these allow you determine when a row was made and when it was last 
changed.  The origin of a modified row is always the row with the exact same 
`_row_id` in the commit before `last-updated-seq`.
   
   Impact on read should be 0 since these columns do not need to actually be 
materialized by scans. Impact on merge statements/copy statements should be 
slightly increased because more data has to go through the compute engine 
although this will differ in efficiency based on the engine.
   
   On file size this should be relatively low impact but we can do some 
benchmarks once the reference implementation is done. For use cases without 
row-level-updates it would be very very cheap since any materialized `row_id` 
and `last-updated-sequence` values should be either very very similar (and 
compressible) or identical. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spec: Adds Row Lineage [iceberg]

Reply via email to