Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]

via GitHub Mon, 28 Oct 2024 16:30:20 -0700


rdblue commented on code in PR #11240:
URL: https://github.com/apache/iceberg/pull/11240#discussion_r1819892263



##########
format/spec.md:
##########
@@ -585,13 +589,19 @@ The schema of a manifest file is a struct called 
`manifest_entry` with the follo
 | _optional_ | _optional_ | _optional_ | **`132  split_offsets`**          | 
`list<133: long>`                                                           | 
Split offsets for the data file. For example, all row group offsets in a 
Parquet file. Must be sorted ascending                                          
                                                          |
 |            | _optional_ | _optional_ | **`135  equality_ids`**           | 
`list<136: int>`                                                            | 
Field ids used to determine row equality in equality delete files. Required 
when `content=2` and should be null otherwise. Fields with ids listed in this 
column must be present in the delete file                |
 | _optional_ | _optional_ | _optional_ | **`140  sort_order_id`**          | 
`int`                                                                       | 
ID representing sort order for this file [3].                                   
                                                                                
                                                   |
-|            |            | _optional_ | **`142  first_row_id`**           | 
`long`                                                                      | 
The `_row_id` for the first row in the data file. See [First Row ID 
Inheritance](#first-row-id-inheritance)                                         
                                                                        |
+|            |            | _optional_ | **`142  first_row_id`**           | 
`long`                                                                      | 
The `_row_id` for the first row in the data file. See [First Row ID 
Inheritance](#first-row-id-inheritance)                                         
                                                               |
+|            | _optional_ | _optional_ | **`143  referenced_data_file`**   | 
`string`                                                                    | 
Fully qualified location (URI with FS scheme) of a data file that all deletes 
reference [4]                                                                   
                                                     |
+|            |            | _optional_ | **`144  content_offset`**         | 
`long`                                                                      | 
The offset in the file where the content starts [5]                             
                                                                                
                                                   |
+|            |            | _optional_ | **`145  content_size_in_bytes`**  | 
`long`                                                                      | 
The length of a referenced content stored in the file; required if 
`content_offset` is present [5]                                                 
                                                                |
+
 Notes:
 
 1. Single-value serialization for lower and upper bounds is detailed in 
Appendix D.
 2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the 
IEEE 754 `totalOrder` predicate. NaNs are not permitted as lower or upper 
bounds.
 3. If sort order ID is missing or unknown, then the order is assumed to be 
unsorted. Only data files and equality delete files should be written with a 
non-null order id. [Position deletes](#position-delete-files) are required to 
be sorted by file and position, not a table order, and should set sort order id 
to null. Readers must ignore sort order id for position delete files.
-4. The following field ids are reserved on `data_file`: 141.
+4. Position delete metadata can use `referenced_data_file` when all deletes 
tracked by the entry are in a single data file. Setting the referenced file is 
required for deletion vectors.
+5. The `content_offset` and `content_size_in_bytes` fields are used to 
reference a specific blob for direct access to a deletion vector. The values 
must exactly match the `offset` and `length` stored in the Puffin footer for 
the deletion vector blob.

Review Comment:
   Updated to work around the fact that these aren't actually required in the 
table.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]

Reply via email to