Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796405024 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at Review Comment: Extra ` ` before `A set bit ...`? ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at Review Comment: Extra space before `A set bit ...`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796407251 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a Review Comment: Extra space here too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796409299 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian Review Comment: It is the length of the bitmap serialized using the portable serialization + 4 extra bytes for the magic number. ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian Review Comment: It is the length of the bitmap serialized using the portable format + 4 extra bytes for the magic number. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
emkornfield commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1796412573 ## format/spec.md: ## @@ -454,35 +457,40 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo `data_file` is a struct with the following fields: -| v1 | v2 | Field id, name| Type | Description | -| -- | -- |---|--|-| -|| _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | -| _required_ | _required_ | **`100 file_path`** | `string` | Full URI for the file with FS scheme | -| _required_ | _required_ | **`101 file_format`**| `string` | String file format name, avro, orc or parquet | -| _required_ | _required_ | **`102 partition`** | `struct<...>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids | -| _required_ | _required_ | **`103 record_count`** | `long` | Number of records in this file | -| _required_ | _required_ | **`104 file_size_in_bytes`** | `long` | Total file size in bytes | -| _required_ || ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default in v1. Do not write in v2.** | -| _optional_ || ~~**`106 file_ordinal`**~~ | `int` | **Deprecated. Do not write.** | -| _optional_ || ~~**`107 sort_columns`**~~ | `list<112: int>` | **Deprecated. Do not write.** | -| _optional_ | _optional_ | **`108 column_sizes`** | `map<117: int, 118: long>` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro) | -| _optional_ | _optional_ | **`109 value_counts`** | `map<119: int, 120: long>` | Map from column id to number of values in the column (including null and NaN values) | -| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column | -| _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column | -| _optional_ | _optional_ | **`111 distinct_counts`**| `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts | -| _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] | -| _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] | -| _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption | -| _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>`| Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending | -|| _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file | -| _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | +| v1 | v2 | v3 | Field id, name| Type | Description | +| -- | -- | -- |---|--|-| +|| _required_ | _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | +| _required_ | _required_ | _
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
emkornfield commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1796415097 ## format/spec.md: ## @@ -841,19 +855,45 @@ Notes: ## Delete Formats -This section details how to encode row-level deletes in Iceberg delete files. Row-level deletes are not supported in v1. +This section details how to encode row-level deletes in Iceberg delete files. Row-level deletes are added by v2 and are not supported in v1. Deletion vectors are added in v3 and are not supported in v2 or earlier. Position delete files must not be added to v3 tables, but existing position delete files are valid. + +There are three types of row-level deletes: +* Deletion vectors (DVs) identify deleted rows within a single referenced data file by position in a bitmap +* Position delete files identify deleted rows by file location and row position (**deprecated**) +* Equality delete files identify deleted rows by the value of one or more columns + +Deletion vectors are a binary representation of deletes for a single data file that is more efficient at execution time than position delete files. Unlike equality or position delete files, there can be at most one deletion vector for a given data file in a table. Writers must ensure that there is at most one deletion vector per data file and must merge new deletes with existing vectors or position delete files. + +Row-level delete files (both equality and position delete files) are valid Iceberg data files: files must use valid Iceberg formats, schemas, and column projection. It is recommended that these delete files are written using the table's default file format. + +Row-level delete files and deletion vectors are tracked by manifests. A separate set of manifests is used for delete files and DVs, but the same manifest schema is used for both data and delete manifests. Deletion vectors are tracked individually by file location, offset, and length within the containing file. Deletion vector metadata must include the referenced data file. + +Both position and equality delete files allow encoding deleted row values with a delete. This can be used to reconstruct a stream of changes to a table. + -Row-level delete files are valid Iceberg data files: files must use valid Iceberg formats, schemas, and column projection. It is recommended that delete files are written using the table's default file format. +### Deletion Vectors -Row-level delete files are tracked by manifests, like data files. A separate set of manifests is used for delete files, but the manifest schemas are identical. +Deletion vectors identify deleted rows of a file by encoding deleted positions in a bitmap. A set bit at position P indicates that the row at position P is deleted. -Both position and equality deletes allow encoding deleted row values with a delete. This can be used to reconstruct a stream of changes to a table. +These vectors are stored using the `delete-vector-v1` blob definition from the [Puffin spec][puffin-spec]. +Deletion vectors support positive 64-bit positions, but are optimized for cases where most positions fit in 32 bits by using a collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a 32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using the least significant 4 bytes. For each key in the set of positions, a 32-bit Roaring bitmap is maintained to store a set of 32-bit sub-positions for that key. Review Comment: nit: I think this paragraph and the next one are somewhat redundant with the puffin spec (or if they do provide additional information, I think consolidating in puffin makes sense, since the operations would like change for a v2 bitmap). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
emkornfield commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1796412573 ## format/spec.md: ## @@ -454,35 +457,40 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo `data_file` is a struct with the following fields: -| v1 | v2 | Field id, name| Type | Description | -| -- | -- |---|--|-| -|| _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | -| _required_ | _required_ | **`100 file_path`** | `string` | Full URI for the file with FS scheme | -| _required_ | _required_ | **`101 file_format`**| `string` | String file format name, avro, orc or parquet | -| _required_ | _required_ | **`102 partition`** | `struct<...>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids | -| _required_ | _required_ | **`103 record_count`** | `long` | Number of records in this file | -| _required_ | _required_ | **`104 file_size_in_bytes`** | `long` | Total file size in bytes | -| _required_ || ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default in v1. Do not write in v2.** | -| _optional_ || ~~**`106 file_ordinal`**~~ | `int` | **Deprecated. Do not write.** | -| _optional_ || ~~**`107 sort_columns`**~~ | `list<112: int>` | **Deprecated. Do not write.** | -| _optional_ | _optional_ | **`108 column_sizes`** | `map<117: int, 118: long>` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro) | -| _optional_ | _optional_ | **`109 value_counts`** | `map<119: int, 120: long>` | Map from column id to number of values in the column (including null and NaN values) | -| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column | -| _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column | -| _optional_ | _optional_ | **`111 distinct_counts`**| `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts | -| _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] | -| _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] | -| _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption | -| _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>`| Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending | -|| _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file | -| _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | +| v1 | v2 | v3 | Field id, name| Type | Description | +| -- | -- | -- |---|--|-| +|| _required_ | _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | +| _required_ | _required_ | _
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796415699 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` Review Comment: If we were to do it from scratch, I personally would skip the length field and use consistent byte order for the magic number, bitmap, and checksum. The magic number is needed as a sanity check because readers will directly jump to an offset without reading the Puffin footer. Therefore, checking the magic number ensures we are reading what we expect without an extra request to the footer. Using portable serialization for 64-bit Roaring bitmaps is the only reasonable option, so that wouldn't change too. CRC helps catch bit flips if the underlying storage becomes faulty. It is still a question for the Iceberg community to whether not including the length and using consistent byte order is significant enough to break the compatibility with the existing Delta spec. It is a bit tricky understand, I will admit that. @emkornfield, what are your thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796415699 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` Review Comment: If we were to do it from scratch, I personally would skip the length field and use consistent byte order for the magic number, bitmap, and checksum. The magic number is needed as a sanity check because readers will directly jump to an offset without reading the Puffin footer. Therefore, checking the magic number ensures we are reading what we expect without an extra request to the footer. Using portable serialization for 64-bit Roaring bitmaps is the only reasonable option, so that wouldn't change too. CRC helps catch bit flips if the underlying storage becomes faulty. It is still a question for the Iceberg community to whether not including the length and using consistent byte order is significant enough to break the compatibility with the existing Delta spec. This layout is a bit tricky understand, I will admit that. @emkornfield, what are your thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
emkornfield commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1796416800 ## format/spec.md: ## @@ -454,35 +457,40 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo `data_file` is a struct with the following fields: -| v1 | v2 | Field id, name| Type | Description | -| -- | -- |---|--|-| -|| _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | -| _required_ | _required_ | **`100 file_path`** | `string` | Full URI for the file with FS scheme | -| _required_ | _required_ | **`101 file_format`**| `string` | String file format name, avro, orc or parquet | -| _required_ | _required_ | **`102 partition`** | `struct<...>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids | -| _required_ | _required_ | **`103 record_count`** | `long` | Number of records in this file | -| _required_ | _required_ | **`104 file_size_in_bytes`** | `long` | Total file size in bytes | -| _required_ || ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default in v1. Do not write in v2.** | -| _optional_ || ~~**`106 file_ordinal`**~~ | `int` | **Deprecated. Do not write.** | -| _optional_ || ~~**`107 sort_columns`**~~ | `list<112: int>` | **Deprecated. Do not write.** | -| _optional_ | _optional_ | **`108 column_sizes`** | `map<117: int, 118: long>` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro) | -| _optional_ | _optional_ | **`109 value_counts`** | `map<119: int, 120: long>` | Map from column id to number of values in the column (including null and NaN values) | -| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column | -| _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column | -| _optional_ | _optional_ | **`111 distinct_counts`**| `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts | -| _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] | -| _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] | -| _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption | -| _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>`| Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending | -|| _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file | -| _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | +| v1 | v2 | v3 | Field id, name| Type | Description | +| -- | -- | -- |---|--|-| +|| _required_ | _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | +| _required_ | _required_ | _
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796416775 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian Review Comment: Correct, this choice was driven by compatibility with Delta. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] open-api: Build runtime jar for test fixture [iceberg]
ajantha-bhat commented on code in PR #11279: URL: https://github.com/apache/iceberg/pull/11279#discussion_r1796417822 ## build.gradle: ## @@ -1006,6 +1009,37 @@ project(':iceberg-open-api') { recommend.set(true) } check.dependsOn('validateRESTCatalogSpec') + + // Create a custom configuration that extends testFixturesImplementation and is resolvable + configurations { +testFixturesShadowImplementation { + canBeResolved = true + canBeConsumed = false + extendsFrom testFixturesImplementation +} + } + + shadowJar { +archiveBaseName.set("iceberg-open-api-test-fixtures-runtime") +archiveClassifier.set(null) +configurations = [project.configurations.testFixturesShadowImplementation] +from sourceSets.testFixtures.output +zip64 true + +// include the LICENSE and NOTICE files for the runtime Jar +from(projectDir) { + include 'LICENSE' Review Comment: Do we have to do this manually or we use the standard tool for this? I don't think we have a standard guidelines for this in "CONTRIBUTING.md". Can we add this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
emkornfield commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1796418244 ## format/spec.md: ## @@ -454,35 +457,40 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo `data_file` is a struct with the following fields: -| v1 | v2 | Field id, name| Type | Description | -| -- | -- |---|--|-| -|| _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | -| _required_ | _required_ | **`100 file_path`** | `string` | Full URI for the file with FS scheme | -| _required_ | _required_ | **`101 file_format`**| `string` | String file format name, avro, orc or parquet | -| _required_ | _required_ | **`102 partition`** | `struct<...>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids | -| _required_ | _required_ | **`103 record_count`** | `long` | Number of records in this file | -| _required_ | _required_ | **`104 file_size_in_bytes`** | `long` | Total file size in bytes | -| _required_ || ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default in v1. Do not write in v2.** | -| _optional_ || ~~**`106 file_ordinal`**~~ | `int` | **Deprecated. Do not write.** | -| _optional_ || ~~**`107 sort_columns`**~~ | `list<112: int>` | **Deprecated. Do not write.** | -| _optional_ | _optional_ | **`108 column_sizes`** | `map<117: int, 118: long>` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro) | -| _optional_ | _optional_ | **`109 value_counts`** | `map<119: int, 120: long>` | Map from column id to number of values in the column (including null and NaN values) | -| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column | -| _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column | -| _optional_ | _optional_ | **`111 distinct_counts`**| `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts | -| _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] | -| _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] | -| _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption | -| _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>`| Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending | -|| _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file | -| _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | +| v1 | v2 | v3 | Field id, name| Type | Description | +| -- | -- | -- |---|--|-| +|| _required_ | _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | Review Comment: I don't ha
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796418361 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian + +The position vector is serialized using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: +- The key stored as 4 bytes, little-endian +- A [32-bit Roaring bitmap][roaring-bitmap-general-layout] Review Comment: These 32-bitmaps have no meaning on their own. I'd argue we need the ability to skip entire 64-bit bitmaps, which is possible with the proposed layout. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796418361 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian + +The position vector is serialized using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: +- The key stored as 4 bytes, little-endian +- A [32-bit Roaring bitmap][roaring-bitmap-general-layout] Review Comment: These 32-bitmaps have no meaning on their own. I'd argue we need the ability to skip entire 64-bit bitmaps, which is possible with the proposed layout. This is the official "portable" Roaring format, which is the only standard for 64-bit implementations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796418361 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian + +The position vector is serialized using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: +- The key stored as 4 bytes, little-endian +- A [32-bit Roaring bitmap][roaring-bitmap-general-layout] Review Comment: These 32-bitmaps have no meaning on their own. I'd argue we need the ability to skip entire 64-bit bitmaps, which is possible with the proposed layout. This is the official "portable" Roaring format. This is the only standard for 64-bit implementations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
emkornfield commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1796419039 ## format/spec.md: ## @@ -454,35 +457,40 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo `data_file` is a struct with the following fields: -| v1 | v2 | Field id, name| Type | Description | -| -- | -- |---|--|-| -|| _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | -| _required_ | _required_ | **`100 file_path`** | `string` | Full URI for the file with FS scheme | -| _required_ | _required_ | **`101 file_format`**| `string` | String file format name, avro, orc or parquet | -| _required_ | _required_ | **`102 partition`** | `struct<...>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids | -| _required_ | _required_ | **`103 record_count`** | `long` | Number of records in this file | -| _required_ | _required_ | **`104 file_size_in_bytes`** | `long` | Total file size in bytes | -| _required_ || ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default in v1. Do not write in v2.** | -| _optional_ || ~~**`106 file_ordinal`**~~ | `int` | **Deprecated. Do not write.** | -| _optional_ || ~~**`107 sort_columns`**~~ | `list<112: int>` | **Deprecated. Do not write.** | -| _optional_ | _optional_ | **`108 column_sizes`** | `map<117: int, 118: long>` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro) | -| _optional_ | _optional_ | **`109 value_counts`** | `map<119: int, 120: long>` | Map from column id to number of values in the column (including null and NaN values) | -| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column | -| _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column | -| _optional_ | _optional_ | **`111 distinct_counts`**| `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts | -| _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] | -| _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] | -| _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption | -| _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>`| Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending | -|| _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file | -| _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | +| v1 | v2 | v3 | Field id, name| Type | Description | +| -- | -- | -- |---|--|-| +|| _required_ | _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | +| _required_ | _required_ | _
Re: [PR] API, Core: Add scan planning api models and parsers [iceberg]
amogh-jahagirdar commented on code in PR #11180: URL: https://github.com/apache/iceberg/pull/11180#discussion_r1796419309 ## .palantir/revapi.yml: ## @@ -1058,6 +1058,11 @@ acceptedBreaks: new: "method void org.apache.iceberg.encryption.PlaintextEncryptionManager::()" justification: "Deprecations for 1.6.0 release" "1.6.0": +org.apache.iceberg:iceberg-api: +- code: "java.class.defaultSerializationChanged" + old: "class org.apache.iceberg.expressions.ResidualEvaluator" + new: "class org.apache.iceberg.expressions.ResidualEvaluator" + justification: "local testing" Review Comment: If we do what I mentioned below we should be able to get rid of this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
emkornfield commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1796419515 ## format/spec.md: ## @@ -454,35 +457,40 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo `data_file` is a struct with the following fields: -| v1 | v2 | Field id, name| Type | Description | -| -- | -- |---|--|-| -|| _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | -| _required_ | _required_ | **`100 file_path`** | `string` | Full URI for the file with FS scheme | -| _required_ | _required_ | **`101 file_format`**| `string` | String file format name, avro, orc or parquet | -| _required_ | _required_ | **`102 partition`** | `struct<...>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids | -| _required_ | _required_ | **`103 record_count`** | `long` | Number of records in this file | -| _required_ | _required_ | **`104 file_size_in_bytes`** | `long` | Total file size in bytes | -| _required_ || ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default in v1. Do not write in v2.** | -| _optional_ || ~~**`106 file_ordinal`**~~ | `int` | **Deprecated. Do not write.** | -| _optional_ || ~~**`107 sort_columns`**~~ | `list<112: int>` | **Deprecated. Do not write.** | -| _optional_ | _optional_ | **`108 column_sizes`** | `map<117: int, 118: long>` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro) | -| _optional_ | _optional_ | **`109 value_counts`** | `map<119: int, 120: long>` | Map from column id to number of values in the column (including null and NaN values) | -| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column | -| _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column | -| _optional_ | _optional_ | **`111 distinct_counts`**| `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts | -| _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] | -| _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] | -| _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption | -| _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>`| Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending | -|| _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file | -| _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | +| v1 | v2 | v3 | Field id, name| Type | Description | +| -- | -- | -- |---|--|-| +|| _required_ | _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | Review Comment: I will not
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796420278 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian + +The position vector is serialized using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: +- The key stored as 4 bytes, little-endian +- A [32-bit Roaring bitmap][roaring-bitmap-general-layout] + +Note that the length and CRC fields are stored using big-endian, but the +Roaring bitmap format uses little-endian values. Big endian values were chosen +for compatibility with existing deletion vectors. + +The blob metadata must include the following properties: + +* `referenced-data-file`: location of the data file the delete vector applies Review Comment: The problem is that it will be in the footer, so it will require an extra request. Are there good use cases in mind, @emkornfield? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796419760 ## format/puffin-spec.md: ## @@ -123,6 +123,44 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector that represents the positions of rows in a file that +are deleted. A set bit at position P indicates that the row at position P is +deleted. + +The bitmap supports positive 64-bit positions, but is optimized for cases where +most positions fit in 32 bits by using a collection of 32-bit Roaring bitmaps. +64-bit positions are divided into a 32-bit "key" using the most significant 4 +bytes and a 32-bit position using the least significant 4 bytes. For each key +in the set of positions, a 32-bit Roaring bitmap is maintained to store a set +of 32-bit positions for that key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes are +tested for inclusion in the bitmap. If a bitmap is not found for the key, then +it is not set. + +The serialized blob starts with a 4-byte magic sequence, `D1D33964` (1681511377 +stored as 4 bytes, little-endian). Following the magic bytes is the serialized +collection of bitmaps. The collection is stored using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: +- The key stored as 4 bytes, little-endian +- A [32-bit Roaring bitmap][roaring-bitmap-general-layout] + +The blob metadata must include the following properties: + +* `referenced-data-file`: location of the data file the delete vector applies to +* `cardinality`: number of deleted rows (set positions) in the delete vector Review Comment: The plan was to set sequence number and snapshot ID to -1 as they will be assigned during commits. We should update this PR to include this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add scan planning api models and parsers [iceberg]
amogh-jahagirdar commented on code in PR #11180: URL: https://github.com/apache/iceberg/pull/11180#discussion_r1796421177 ## core/src/main/java/org/apache/iceberg/rest/RESTFileScanTaskParser.java: ## @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.rest; + +import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.databind.JsonNode; +import java.io.IOException; +import java.util.List; +import org.apache.iceberg.BaseFileScanTask; +import org.apache.iceberg.DataFile; +import org.apache.iceberg.DeleteFile; +import org.apache.iceberg.FileScanTask; +import org.apache.iceberg.expressions.Expression; +import org.apache.iceberg.expressions.ExpressionParser; +import org.apache.iceberg.expressions.Expressions; +import org.apache.iceberg.expressions.ResidualEvaluator; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; + +public class RESTFileScanTaskParser { + private static final String DATA_FILE = "data-file"; + private static final String DELETE_FILE_REFERENCES = "delete-file-references"; + private static final String RESIDUAL = "residual-filter"; + + private RESTFileScanTaskParser() {} + + public static void toJson( + FileScanTask fileScanTask, List deleteFiles, JsonGenerator generator) + throws IOException { +Preconditions.checkArgument(fileScanTask != null, "Invalid file scan task: null"); +Preconditions.checkArgument(generator != null, "Invalid JSON generator: null"); + +generator.writeStartObject(); +generator.writeFieldName(DATA_FILE); +RESTContentFileParser.toJson(fileScanTask.file(), generator); + +// TODO revisit this logic +if (deleteFiles != null) { + generator.writeArrayFieldStart(DELETE_FILE_REFERENCES); + for (int delIndex = 0; delIndex < deleteFiles.size(); delIndex++) { +generator.writeNumber(delIndex); + } + generator.writeEndArray(); +} +if (fileScanTask.residual() != null) { + generator.writeFieldName(RESIDUAL); + ExpressionParser.toJson(fileScanTask.residual(), generator); +} +generator.writeEndObject(); + } + + public static FileScanTask fromJson(JsonNode jsonNode, List allDeleteFiles) { +Preconditions.checkArgument(jsonNode != null, "Invalid JSON node for file scan task: null"); +Preconditions.checkArgument( +jsonNode.isObject(), "Invalid JSON node for file scan task: non-object (%s)", jsonNode); + +DataFile dataFile = (DataFile) RESTContentFileParser.fromJson(jsonNode.get(DATA_FILE)); + +DeleteFile[] matchedDeleteFiles = null; +List deleteFileReferences = null; +if (jsonNode.has(DELETE_FILE_REFERENCES)) { + ImmutableList.Builder deleteFileReferencesBuilder = ImmutableList.builder(); + JsonNode deletesArray = jsonNode.get(DELETE_FILE_REFERENCES); + for (JsonNode deleteRef : deletesArray) { +deleteFileReferencesBuilder.add(deleteRef); + } + deleteFileReferences = deleteFileReferencesBuilder.build(); +} + +if (deleteFileReferences != null) { + ImmutableList.Builder matchedDeleteFilesBuilder = ImmutableList.builder(); + for (Integer deleteFileIdx : deleteFileReferences) { +matchedDeleteFilesBuilder.add(allDeleteFiles.get(deleteFileIdx)); + } + matchedDeleteFiles = (DeleteFile[]) matchedDeleteFilesBuilder.build().stream().toArray(); +} + +// TODO revisit this in spec +Expression filter = Expressions.alwaysTrue(); +if (jsonNode.has(RESIDUAL)) { + filter = ExpressionParser.fromJson(jsonNode.get(RESIDUAL)); +} + +ResidualEvaluator residualEvaluator = ResidualEvaluator.of(filter); + +// TODO at the time of creation we dont have the schemaString and specString so can we avoid +// setting this +// will need to refresh before returning closed iterable of tasks, for now put place holder null +BaseFileScanTask baseFileScanTask = +new BaseFileScanTask(dataFile, matchedDeleteFiles, null, null, residualEvaluator); Review Comment: we can introduce a `UnboundFileScanTask` which extends BaseFileSca
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
emkornfield commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796421105 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian + +The position vector is serialized using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: +- The key stored as 4 bytes, little-endian +- A [32-bit Roaring bitmap][roaring-bitmap-general-layout] + +Note that the length and CRC fields are stored using big-endian, but the +Roaring bitmap format uses little-endian values. Big endian values were chosen +for compatibility with existing deletion vectors. + +The blob metadata must include the following properties: + +* `referenced-data-file`: location of the data file the delete vector applies Review Comment: I was assuming footer parsing first. I think if the intent to jump directly to a location from manifests, then it would serve no purpose (I added some questions in this regard on the table spec changes). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796415699 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` Review Comment: If we were to do it from scratch, I personally would skip the length field and use consistent byte order for the magic number, bitmap, and checksum. The magic number is needed as a sanity check because readers will directly jump to an offset without reading the Puffin footer. Therefore, checking the magic number ensures we are reading what we expect without an extra request to the footer. Using portable serialization for 64-bit Roaring bitmaps is the only reasonable option, so that wouldn't change too. CRC helps catch bit flips if the underlying storage becomes faulty, but I wish we used little endian as for the bitmap and magic number. It is still a question for the Iceberg community to whether not including the length and using consistent byte order is significant enough to break the compatibility with the existing Delta spec. This layout is a bit tricky understand, I will admit that. @emkornfield, what are your thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
aokolnychyi commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1796427242 ## format/spec.md: ## @@ -619,19 +627,25 @@ Data files that match the query filter must be read by the scan. Note that for any snapshot, all file paths marked with "ADDED" or "EXISTING" may appear at most once across all manifest files in the snapshot. If a file path appears more than once, the results of the scan are undefined. Reader implementations may raise an error in this case, but are not required to do so. -Delete files that match the query filter must be applied to data files at read time, limited by the scope of the delete file using the following rules. +Delete files and deletion vector metadata that match the filters must be applied to data files at read time, limited by the following scope rules. +* A deletion vector must be applied to a data file when all of the following are true: +- The data file's `file_path` is equal to the deletion vector's `referenced_data_file` +- The data file's data sequence number is _less than or equal to_ the deletion vector's data sequence number +- The data file's partition (both spec and partition values) is equal [4] to the deletion vector's partition Review Comment: Using the correct spec and partition for deletion vectors is important so that we can skip them using partition predicates, just like we skip position delete files today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] chore(deps): Bump crate-ci/typos from 1.24.6 to 1.25.0 [iceberg-rust]
liurenjie1024 closed issue #659: chore(deps): Bump crate-ci/typos from 1.24.6 to 1.25.0 URL: https://github.com/apache/iceberg-rust/issues/659 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
aokolnychyi commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1796429496 ## format/spec.md: ## @@ -841,19 +855,45 @@ Notes: ## Delete Formats -This section details how to encode row-level deletes in Iceberg delete files. Row-level deletes are not supported in v1. +This section details how to encode row-level deletes in Iceberg delete files. Row-level deletes are added by v2 and are not supported in v1. Deletion vectors are added in v3 and are not supported in v2 or earlier. Position delete files must not be added to v3 tables, but existing position delete files are valid. Review Comment: Shall we add some notes that the first ever operation after migrating table to V3 must produce a DV that includes all previous position deletes for that file? And that implementations can safely discard all file-scoped delete files but have to keep partition-scoped delete files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Api, Spark: Make StrictMetricsEvaluator not fail on nested column predicates [iceberg]
amogh-jahagirdar commented on code in PR #11261: URL: https://github.com/apache/iceberg/pull/11261#discussion_r1796430454 ## api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java: ## @@ -60,7 +57,6 @@ public StrictMetricsEvaluator(Schema schema, Expression unbound) { } public StrictMetricsEvaluator(Schema schema, Expression unbound, boolean caseSensitive) { -this.schema = schema; Review Comment: Ah nvm, I see that it was only used for the purpose of resolving if a field was nested or not ## api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java: ## @@ -60,7 +57,6 @@ public StrictMetricsEvaluator(Schema schema, Expression unbound) { } public StrictMetricsEvaluator(Schema schema, Expression unbound, boolean caseSensitive) { -this.schema = schema; Review Comment: Why are we removing this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
aokolnychyi commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1796427704 ## format/spec.md: ## @@ -619,19 +627,25 @@ Data files that match the query filter must be read by the scan. Note that for any snapshot, all file paths marked with "ADDED" or "EXISTING" may appear at most once across all manifest files in the snapshot. If a file path appears more than once, the results of the scan are undefined. Reader implementations may raise an error in this case, but are not required to do so. -Delete files that match the query filter must be applied to data files at read time, limited by the scope of the delete file using the following rules. +Delete files and deletion vector metadata that match the filters must be applied to data files at read time, limited by the following scope rules. +* A deletion vector must be applied to a data file when all of the following are true: +- The data file's `file_path` is equal to the deletion vector's `referenced_data_file` +- The data file's data sequence number is _less than or equal to_ the deletion vector's data sequence number +- The data file's partition (both spec and partition values) is equal [4] to the deletion vector's partition * A _position_ delete file must be applied to a data file when all of the following are true: +- The data file's `file_path` is equal to the delete file's `referenced_data_file` if it is non-null - The data file's data sequence number is _less than or equal to_ the delete file's data sequence number - The data file's partition (both spec and partition values) is equal [4] to the delete file's partition +- There is no deletion vector that must be applied to the data file (when added, such a vector must contain all deletes from existing position delete files) Review Comment: Correct, this is how the design doc proposes it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
emkornfield commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796430758 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian + +The position vector is serialized using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: +- The key stored as 4 bytes, little-endian +- A [32-bit Roaring bitmap][roaring-bitmap-general-layout] Review Comment: I might be misunderstanding the layout, and it is enough of the edge case for this use-case that it doesn't matter but suppose I have 3 bitmaps (e.g. prefix keys 0x00, 0x01, 0x02, which implies at least an 8 row Billion data row file?), it appears one couldn't skip to a 0x01 bitmap without first reading the 0x00 (i.e. don't quite get how this could be random access without loading the entire data-structure)? If a the layout was instead `` one could randomly seek to the bitmap of interest without parsing preceding ones. Again given the edge case of having > 4B data files I think this point is mostly theoretical. I guess the alternative again which is worth discussing is putting an upper bound of `UINT32_MAX` rows on data files which would allow one just use the 32-bit roaring bitmap. Maybe famous last words but a 4 Billion row iceberg data file should be enough for anyone :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
emkornfield commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796431209 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian Review Comment: This seems fairly redundant with offset and length information that we are already storing, even for compatibility purposes I'm not sure why it is needed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
emkornfield commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796436581 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` Review Comment: > Therefore, checking the magic number ensures we are reading what we expect without an extra request to the footer. I'm a little skeptical on value here, given we are already labelling this content elsewhere. The internal consistency of the data structures need to be vetted anyways and probably unlikely to get valid ones if the offset is incorrect. The CRC32 provides further sanity checks if the structures do appear to be valid. Its 4 bytes, so at the end of the day there are likely bigger fish to fry. > It is still a question for the Iceberg community to whether not including the length and using consistent byte order is significant enough to break the compatibility with the existing Delta spec. This layout is a bit tricky understand, I will admit that. My (non-binding) vote would be -0.1, simply copy and pasting the delta lake spec. Given the relatively small differences, on the surface it seems that the Delta Lake spec could be revised in a manner to converge with the Iceberg spec without replicating a technical decision. The likelihood that these actually have any impact, as long as the spec is clear enough is probably minimal. CRC version probably has the most impact (and I don't have any stats but I assume this would be small). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
emkornfield commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1796441921 ## format/spec.md: ## @@ -454,35 +457,40 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo `data_file` is a struct with the following fields: -| v1 | v2 | Field id, name| Type | Description | -| -- | -- |---|--|-| -|| _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | -| _required_ | _required_ | **`100 file_path`** | `string` | Full URI for the file with FS scheme | -| _required_ | _required_ | **`101 file_format`**| `string` | String file format name, avro, orc or parquet | -| _required_ | _required_ | **`102 partition`** | `struct<...>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids | -| _required_ | _required_ | **`103 record_count`** | `long` | Number of records in this file | -| _required_ | _required_ | **`104 file_size_in_bytes`** | `long` | Total file size in bytes | -| _required_ || ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default in v1. Do not write in v2.** | -| _optional_ || ~~**`106 file_ordinal`**~~ | `int` | **Deprecated. Do not write.** | -| _optional_ || ~~**`107 sort_columns`**~~ | `list<112: int>` | **Deprecated. Do not write.** | -| _optional_ | _optional_ | **`108 column_sizes`** | `map<117: int, 118: long>` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro) | -| _optional_ | _optional_ | **`109 value_counts`** | `map<119: int, 120: long>` | Map from column id to number of values in the column (including null and NaN values) | -| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column | -| _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column | -| _optional_ | _optional_ | **`111 distinct_counts`**| `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts | -| _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] | -| _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] | -| _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption | -| _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>`| Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending | -|| _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file | -| _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | +| v1 | v2 | v3 | Field id, name| Type | Description | +| -- | -- | -- |---|--|-| +|| _required_ | _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | Review Comment: @aokolnych
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
stevenzwu commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796422065 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at Review Comment: nit: a file -> a data file ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` Review Comment: is the choice of little endian to match most processors' architecture to avoid byte swapping on read? Agree that it would be nice to use consistent byte ordering (little endian). Parquet and Orc seem to use little endian to encode integers too. I don't fully understand the compatibility part with Delta lake. It is not like Delta lake would use Puffin file directly, right? I assume some conversion is still needed. those meta fields are cheap to convert byte order if needed, as long as we keep the big bitmap vector compatible with the same portable serialization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
rdblue commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1796135851 ## format/spec.md: ## @@ -619,19 +627,25 @@ Data files that match the query filter must be read by the scan. Note that for any snapshot, all file paths marked with "ADDED" or "EXISTING" may appear at most once across all manifest files in the snapshot. If a file path appears more than once, the results of the scan are undefined. Reader implementations may raise an error in this case, but are not required to do so. -Delete files that match the query filter must be applied to data files at read time, limited by the scope of the delete file using the following rules. +Delete files and deletion vector metadata that match the filters must be applied to data files at read time, limited by the following scope rules. +* A deletion vector must be applied to a data file when all of the following are true: +- The data file's `file_path` is equal to the deletion vector's `referenced_data_file` +- The data file's data sequence number is _less than or equal to_ the deletion vector's data sequence number +- The data file's partition (both spec and partition values) is equal [4] to the deletion vector's partition Review Comment: Yeah, I debated whether to keep this or not and ended up deciding that it is helpful. The second requirement should also be impossible if the first requirement is true. I kept both of these because I want people reading the spec to understand that these can be relied on in the scan planning algorithm. If this only said that the `file_path` must match, then implementers may think that they need to consider _all_ deletion vectors without first filtering by partition and data sequence number. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
rdblue commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1796139073 ## format/spec.md: ## @@ -604,7 +612,7 @@ Scans are planned by reading the manifest files for the current snapshot. Delete Manifests that contain no matching files, determined using either file counts or partition summaries, may be skipped. -For each manifest, scan predicates, which filter data rows, are converted to partition predicates, which filter data and delete files. These partition predicates are used to select the data and delete files in the manifest. This conversion uses the partition spec used to write the manifest file. +For each manifest, scan predicates, which filter data rows, are converted to partition predicates, which filter partition tuples. These partition predicates are used to select data files, delete files, and deletion vector metadata relevant to the scan. This conversion uses the partition spec used to write the manifest file. Review Comment: Thanks! I've incorporated this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: Adds Row Lineage [iceberg]
stevenzwu commented on code in PR #11130: URL: https://github.com/apache/iceberg/pull/11130#discussion_r1796168929 ## format/spec.md: ## @@ -298,16 +298,101 @@ Iceberg tables must not use field ids greater than 2147483447 (`Integer.MAX_VALU The set of metadata columns is: -| Field id, name | Type | Description | -|-|---|-| -| **`2147483646 _file`** | `string` | Path of the file in which a row is stored | -| **`2147483645 _pos`** | `long`| Ordinal position of a row in the source data file | -| **`2147483644 _deleted`** | `boolean` | Whether the row has been deleted | -| **`2147483643 _spec_id`** | `int` | Spec ID used to track the file containing a row | -| **`2147483642 _partition`** | `struct` | Partition to which a row belongs | -| **`2147483546 file_path`** | `string` | Path of a file, used in position-based delete files | -| **`2147483545 pos`** | `long`| Ordinal position of a row, used in position-based delete files | -| **`2147483544 row`** | `struct<...>` | Deleted row values, used in position-based delete files | +| Field id, name | Type | Description | +|--|---|-| +| **`2147483646 _file`** | `string` | Path of the file in which a row is stored | +| **`2147483645 _pos`** | `long`| Ordinal position of a row in the source data file, starting at `0` | +| **`2147483644 _deleted`** | `boolean` | Whether the row has been deleted| +| **`2147483643 _spec_id`** | `int` | Spec ID used to track the file containing a row | +| **`2147483642 _partition`** | `struct` | Partition to which a row belongs| +| **`2147483546 file_path`** | `string` | Path of a file, used in position-based delete files | +| **`2147483545 pos`**| `long`| Ordinal position of a row, used in position-based delete files | +| **`2147483544 row`**| `struct<...>` | Deleted row values, used in position-based delete files | +| **`2147483540 _row_id`**| `long`| A unique long assigned when row-lineage is enabled see [Row Lineage](#row-lineage) | +| **`2147483539 _last_updated_seq`** | `long`| The sequence number which last updated this row when row-lineage is enabled [Row Lineage](#row-lineage) | + +### Row Lineage + +In v3 and later, an Iceberg table can track row lineage fields for all newly created rows. Row lineage is enabled by setting the field `row-lineage` to true in the table's metadata. When enabled, engines must maintain the `next-row-id` table field and the following row-level fields when writing data files: + +* `_row_id` a unique long identifier for every row within the table. The value is assigned via inheritance when a row is first added to the table and the existing value is explicitly written when the row is written to a new file. +* `_last_updated_seq` the sequence number of the commit that last updated a row. The value is inherited when a row is first added or modified and the existing value is explicitly written when the row is written to a different data file but not modified. + +These fields are assigned and updated by inheritance because the commit sequence number and starting row ID are not assigned until the snapshot is successfully committed. Inheritance is used to allow writing data and manifest files before values are known so that it is not necessary to rewrite data and manifest files when an optimistic commit is retried. + +When row lineage is enabled, new snapshots cannot include [Equality Deletes](#equality-delete-files). Row lineage is incompatible with equality deletes because lineage values must be maintained, but equality deletes are used to avoid reading existing data before writing changes. + + + Row lineage assignment + +Row lineage fields are written when row lineage is enabled. When not enabled, row lineage fields (`_row_id` and `_last_updated_seq`) must not be written to data files. The rest of this section applies when row lineage is enabled. + +When a row is added or modified, the `_last_updated_seq` field is set to `null` so that it
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
emkornfield commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796222680 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian + +The position vector is serialized using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: +- The key stored as 4 bytes, little-endian +- A [32-bit Roaring bitmap][roaring-bitmap-general-layout] + +Note that the length and CRC fields are stored using big-endian, but the +Roaring bitmap format uses little-endian values. Big endian values were chosen +for compatibility with existing deletion vectors. + +The blob metadata must include the following properties: Review Comment: ```suggestion The blob's `properties` object must include the following metadata: ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
emkornfield commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796222680 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian + +The position vector is serialized using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: +- The key stored as 4 bytes, little-endian +- A [32-bit Roaring bitmap][roaring-bitmap-general-layout] + +Note that the length and CRC fields are stored using big-endian, but the +Roaring bitmap format uses little-endian values. Big endian values were chosen +for compatibility with existing deletion vectors. + +The blob metadata must include the following properties: Review Comment: ```suggestion The blob's `properties` object must include the following metadata. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
emkornfield commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796223977 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian + +The position vector is serialized using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: +- The key stored as 4 bytes, little-endian +- A [32-bit Roaring bitmap][roaring-bitmap-general-layout] + +Note that the length and CRC fields are stored using big-endian, but the +Roaring bitmap format uses little-endian values. Big endian values were chosen +for compatibility with existing deletion vectors. + +The blob metadata must include the following properties: Review Comment: it took me a a few seconds to understand where this is stored on blobmetadata. This seems clearer to me but is mostly a nit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] OpenAPI: Standardize credentials in loadTable/loadView responses [iceberg]
nastra commented on code in PR #10722: URL: https://github.com/apache/iceberg/pull/10722#discussion_r1796490776 ## open-api/rest-catalog-open-api.yaml: ## @@ -3142,6 +3163,10 @@ components: type: object additionalProperties: type: string +storage-credentials: + type: array + items: +$ref: '#/components/schemas/Credential' Review Comment: renamed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Derive PartialEq for FileScanTask [iceberg-rust]
sdd commented on PR #660: URL: https://github.com/apache/iceberg-rust/pull/660#issuecomment-2406660124 Looks like you just need a rebase on this to get your Arrow 53.1 fix in :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] RecordBatchTransformer: Handle schema migration and column re-ordering in table scans [iceberg-rust]
sdd commented on PR #602: URL: https://github.com/apache/iceberg-rust/pull/602#issuecomment-2406669061 @Xuanwo PTAL - you approved an earlier version but there are some small additional changes since then. I added: * a performance improvement for a particular scenario; * a change to how schemas are compared for equality for this specific case. (See here: https://github.com/apache/iceberg-rust/pull/602/commits/0ed57212fcff716ea500f3b3e55e14a7ef2b87df) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add REST Catalog tests to Spark 3.5 integration test [iceberg]
nastra commented on code in PR #11093: URL: https://github.com/apache/iceberg/pull/11093#discussion_r1796507653 ## spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMetadataTables.java: ## @@ -18,8 +18,11 @@ */ package org.apache.iceberg.spark.extensions; +import static org.apache.iceberg.CatalogUtil.ICEBERG_CATALOG_TYPE; +import static org.apache.iceberg.CatalogUtil.ICEBERG_CATALOG_TYPE_REST; import static org.apache.iceberg.types.Types.NestedField.optional; import static org.assertj.core.api.Assertions.assertThat; +import static org.junit.jupiter.api.Assumptions.assumeFalse; Review Comment: please use AssertJ assumptions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Before expiring snapshots is there need to provide history snapshot file statistics [iceberg]
MichaelDeSteven commented on issue #11213: URL: https://github.com/apache/iceberg/issues/11213#issuecomment-2406495376 > It might be valuable to add a `dry_run` option to `expire_snapshots` like `remove_orphan_files`. good idea!When `dry_run` is true, don't actually remove files but return info. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
emkornfield commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796385502 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` Review Comment: I see this was discussed in https://github.com/apache/iceberg/pull/11238#discussion_r1794250448 IMO, I think trying to converge the specs by taking slightly questionable designs from Delta lake and propagating them to Iceberg seems like not a great choice. It makes more sense to me to call the existing Delta legacy, and converge onto a representation that both can share. It seems the core payload is easy enough to transfer and recompute CRC as new data is written. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
RussellSpitzer commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796153032 ## format/puffin-spec.md: ## @@ -123,6 +123,49 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions, but is optimized for cases where +most positions fit in 32 bits by using a collection of 32-bit Roaring bitmaps. +64-bit positions are divided into a 32-bit "key" using the most significant 4 +bytes and a 32-bit sub-position using the least significant 4 bytes. For each +key in the set of positions, a 32-bit Roaring bitmap is maintained to store a +set of 32-bit sub-positions for that key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, little-endian Review Comment: So we will be specifically storing a "delta length" which is the real length - 4 bytes? But in manifests we will keep the full blob length including the crc bytes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] chore(deps): bump typos crate to 1.25.0 [iceberg-rust]
Xuanwo merged PR #662: URL: https://github.com/apache/iceberg-rust/pull/662 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
emkornfield commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796159961 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian + +The position vector is serialized using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: +- The key stored as 4 bytes, little-endian +- A [32-bit Roaring bitmap][roaring-bitmap-general-layout] + +Note that the length and CRC fields are stored using big-endian, but the +Roaring bitmap format uses little-endian values. Big endian values were chosen +for compatibility with existing deletion vectors. Review Comment: What existing deletion vectors? Isn't this new :) If this is specifically for compatibility with other table formats we should likely make it explicit -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
emkornfield commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796204011 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian Review Comment: It sounds like this is being used for compatibility for some existing bitmap, but my impression (haven't researched this in a while) is that CRC-32C (Castagnoli) as it has machine level instructions for computation (probably not a huge deal for individual use-cases but saving CPU cycles can't hurt). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
emkornfield commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796200502 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian + +The position vector is serialized using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: +- The key stored as 4 bytes, little-endian +- A [32-bit Roaring bitmap][roaring-bitmap-general-layout] Review Comment: I skimmed through the general layout and I might have missed it but it doesn't seem to include the size of the actual roaring bitmap, this seems like an important detail either way so it is probably worth calling out how clients can get the size of the bitmap. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
emkornfield commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796216340 ## format/puffin-spec.md: ## @@ -123,6 +123,44 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector that represents the positions of rows in a file that +are deleted. A set bit at position P indicates that the row at position P is +deleted. + +The bitmap supports positive 64-bit positions, but is optimized for cases where +most positions fit in 32 bits by using a collection of 32-bit Roaring bitmaps. +64-bit positions are divided into a 32-bit "key" using the most significant 4 +bytes and a 32-bit position using the least significant 4 bytes. For each key +in the set of positions, a 32-bit Roaring bitmap is maintained to store a set +of 32-bit positions for that key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes are +tested for inclusion in the bitmap. If a bitmap is not found for the key, then +it is not set. + +The serialized blob starts with a 4-byte magic sequence, `D1D33964` (1681511377 +stored as 4 bytes, little-endian). Following the magic bytes is the serialized +collection of bitmaps. The collection is stored using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: +- The key stored as 4 bytes, little-endian +- A [32-bit Roaring bitmap][roaring-bitmap-general-layout] + +The blob metadata must include the following properties: + +* `referenced-data-file`: location of the data file the delete vector applies to +* `cardinality`: number of deleted rows (set positions) in the delete vector Review Comment: I think the BlobMetadata section requires refactoring as it currently calls out both snapshot and sequence number as required fields. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
emkornfield commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796202036 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian + +The position vector is serialized using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: Review Comment: nit: it might sense to call out keys here are the leading 32 bits of the 64 bit elements so readers don't need to reference the roaring spec for this detail. Also, since this is for a delete vector, it might be sufficient to limit the key range to be in the signed range to make Java programmer's lives easier (I don't think in practice we really need the extra bits to have full uint64 range for position deletes.). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
emkornfield commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796220782 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian Review Comment: What is length here? Is it potentially already captured by metadata (e.g. offset/length) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
emkornfield commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796221840 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian + +The position vector is serialized using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: +- The key stored as 4 bytes, little-endian +- A [32-bit Roaring bitmap][roaring-bitmap-general-layout] + +Note that the length and CRC fields are stored using big-endian, but the +Roaring bitmap format uses little-endian values. Big endian values were chosen +for compatibility with existing deletion vectors. + +The blob metadata must include the following properties: + +* `referenced-data-file`: location of the data file the delete vector applies Review Comment: Would it make sense to include min row and max row in the delete vector here (we can avoid doing any more work if the row-group being processed is outside this range). This might also be cheap enough to extract from the data itself, so it might be a non-issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Iceberg streaming using checkpoint does not ignore the stream-from-timestamp option [iceberg]
github-actions[bot] closed issue #8921: Iceberg streaming using checkpoint does not ignore the stream-from-timestamp option URL: https://github.com/apache/iceberg/issues/8921 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Iceberg streaming using checkpoint does not ignore the stream-from-timestamp option [iceberg]
github-actions[bot] commented on issue #8921: URL: https://github.com/apache/iceberg/issues/8921#issuecomment-2406263894 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] DELETE fails with "java.lang.IllegalArgumentException: info must be ExtendedLogicalWriteInfo" [iceberg]
github-actions[bot] commented on issue #8926: URL: https://github.com/apache/iceberg/issues/8926#issuecomment-2406263920 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Spark write abort result in table miss metadata location file [iceberg]
github-actions[bot] commented on issue #8927: URL: https://github.com/apache/iceberg/issues/8927#issuecomment-2406263940 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Spark write abort result in table miss metadata location file [iceberg]
github-actions[bot] closed issue #8927: Spark write abort result in table miss metadata location file URL: https://github.com/apache/iceberg/issues/8927 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Missing serialVersionUID in Serializable implementation [iceberg]
github-actions[bot] commented on issue #8929: URL: https://github.com/apache/iceberg/issues/8929#issuecomment-2406263958 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Missing serialVersionUID in Serializable implementation [iceberg]
github-actions[bot] closed issue #8929: Missing serialVersionUID in Serializable implementation URL: https://github.com/apache/iceberg/issues/8929 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Vulnerabilities found on latest version - jackson, avro, openssl [iceberg]
github-actions[bot] closed issue #8923: Vulnerabilities found on latest version - jackson, avro, openssl URL: https://github.com/apache/iceberg/issues/8923 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] DELETE fails with "java.lang.IllegalArgumentException: info must be ExtendedLogicalWriteInfo" [iceberg]
github-actions[bot] closed issue #8926: DELETE fails with "java.lang.IllegalArgumentException: info must be ExtendedLogicalWriteInfo" URL: https://github.com/apache/iceberg/issues/8926 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Vulnerabilities found on latest version - jackson, avro, openssl [iceberg]
github-actions[bot] commented on issue #8923: URL: https://github.com/apache/iceberg/issues/8923#issuecomment-2406263904 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Feature: S3 Remote Signing [iceberg-rust]
Xuanwo commented on issue #506: URL: https://github.com/apache/iceberg-rust/issues/506#issuecomment-2404943472 Hi, @flaneur2020. Thank you for your work, and sorry for the delay with reqsign and opendal. I hope we can integrate them in a more organized way, which will lead to a significant refactor of the entire reqsign codebase. I'll ping you here once it's ready. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Add RewriteTablePath action interface [iceberg]
laithalzyoud commented on PR #10920: URL: https://github.com/apache/iceberg/pull/10920#issuecomment-2404869540 @szehon-ho PTAL -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Table Not Found While reading IcebergTable from Spark SQL [iceberg]
AwasthiSomesh opened a new issue, #11297: URL: https://github.com/apache/iceberg/issues/11297 ### Apache Iceberg version 1.6.1 (latest release) ### Query engine Spark ### Please describe the bug 🐞 Hi Team, I have done setup for hive4 docker images but while reading table from spark sql its throwing exception table not found. Storage is gen2 hive-metastore- derby/postgresql but while reading from table location its working properly. Thanks, Somesh ### Willingness to contribute - [ ] I can contribute a fix for this bug independently - [X] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [ ] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support commit retries [iceberg-python]
mark-major commented on issue #269: URL: https://github.com/apache/iceberg-python/issues/269#issuecomment-2405076085 @maxlucuta Yes, that's what I have been using. It would be nice if there would be an internal retry for the commit so the client application doesn't have to be polluted with the retry. It would be a difficulty to handle the local table metadata object if the retry is pushed to the library level. The commit would succeed, but the metadata object would point to a whole different snapshot. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: Adds Row Lineage [iceberg]
nastra commented on code in PR #11130: URL: https://github.com/apache/iceberg/pull/11130#discussion_r1795454308 ## format/spec.md: ## @@ -298,16 +298,101 @@ Iceberg tables must not use field ids greater than 2147483447 (`Integer.MAX_VALU The set of metadata columns is: -| Field id, name | Type | Description | -|-|---|-| -| **`2147483646 _file`** | `string` | Path of the file in which a row is stored | -| **`2147483645 _pos`** | `long`| Ordinal position of a row in the source data file | -| **`2147483644 _deleted`** | `boolean` | Whether the row has been deleted | -| **`2147483643 _spec_id`** | `int` | Spec ID used to track the file containing a row | -| **`2147483642 _partition`** | `struct` | Partition to which a row belongs | -| **`2147483546 file_path`** | `string` | Path of a file, used in position-based delete files | -| **`2147483545 pos`** | `long`| Ordinal position of a row, used in position-based delete files | -| **`2147483544 row`** | `struct<...>` | Deleted row values, used in position-based delete files | +| Field id, name | Type | Description | +|--|---|-| +| **`2147483646 _file`** | `string` | Path of the file in which a row is stored | +| **`2147483645 _pos`** | `long`| Ordinal position of a row in the source data file, starting at `0` | +| **`2147483644 _deleted`** | `boolean` | Whether the row has been deleted| +| **`2147483643 _spec_id`** | `int` | Spec ID used to track the file containing a row | +| **`2147483642 _partition`** | `struct` | Partition to which a row belongs| +| **`2147483546 file_path`** | `string` | Path of a file, used in position-based delete files | +| **`2147483545 pos`**| `long`| Ordinal position of a row, used in position-based delete files | +| **`2147483544 row`**| `struct<...>` | Deleted row values, used in position-based delete files | +| **`2147483540 _row_id`**| `long`| A unique long assigned when row-lineage is enabled see [Row Lineage](#row-lineage) | Review Comment: ```suggestion | **`2147483540 _row_id`**| `long`| A unique long assigned when row-lineage is enabled, see [Row Lineage](#row-lineage) | ``` ## format/spec.md: ## @@ -488,7 +573,6 @@ The `partition` struct stores the tuple of partition values for each file. Its t The column metrics maps are used when filtering to select both data and delete files. For delete files, the metrics must store bounds and counts for all deleted rows, or must be omitted. Storing metrics for deleted rows ensures that the values can be used during job planning to find delete files that must be merged during a scan. - Review Comment: nit: unrelated change? ## format/spec.md: ## @@ -684,34 +796,38 @@ The atomic operation used to commit metadata depends on how tables are tracked a Table metadata consists of the following fields: -| v1 | v2 | Field | Description | -| -- | -- | - | --- | -| _required_ | _required_ | **`format-version`** | An integer version number for the format. Currently, this can be 1 or 2 based on the spec. Implementations must throw an exception if a table's version is higher than the supported version. | -| _optional_ | _required_ | **`table-uuid`** | A UUID that identifies the table, generated when the table is created. Implementations must throw an exception if a table's UUID does not match the expected UUID after refreshing metadata. | -| _required_ | _required_ | **`location`**| The table's base location. This is used by writers to determine where to store data files, manifest files, and table metadata files. | -|| _required_ | **`last-sequence-number`**| The table's highest assigned sequence number, a monotonically increasing long that tracks the order of snapshots in a table. | -| _required_ | _required_ | **`last-updated-ms`**| Timestamp in milliseconds from the unix epoch when the table was
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
emkornfield commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796158658 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` Review Comment: why is a magic number here if this is already versioned by the blob type? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
emkornfield commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796202036 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian + +The position vector is serialized using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: Review Comment: nit: it might sense to call out keys here are the leading 32 bits of the 64 bit elements so readers don't need to reference the roaring spec for this detail. Also, since this is for a delete vector, it might be sufficient to limit the key range to be in the signed range to make Java programmer's lives easier (I don't think in practice we really need the extra bits to have full uint64 range for position deletes.). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
emkornfield commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796200502 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian + +The position vector is serialized using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: +- The key stored as 4 bytes, little-endian +- A [32-bit Roaring bitmap][roaring-bitmap-general-layout] Review Comment: I skimmed through the general layout and I might have missed it but it doesn't seem to include the size of the actual roaring bitmap, this seems like an important detail either way so it is probably worth calling out how clients can get the size of the bitmap (to skip over it) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
emkornfield commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796210980 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian + +The position vector is serialized using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: Review Comment: rereading this is described above -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add scan planning api models and parsers [iceberg]
amogh-jahagirdar commented on code in PR #11180: URL: https://github.com/apache/iceberg/pull/11180#discussion_r1796287385 ## core/src/main/java/org/apache/iceberg/rest/responses/PlanTableScanResponse.java: ## @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.rest.responses; + +import java.util.List; +import org.apache.iceberg.DeleteFile; +import org.apache.iceberg.FileScanTask; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.rest.PlanStatus; +import org.apache.iceberg.rest.RESTResponse; + +public class PlanTableScanResponse implements RESTResponse { + private PlanStatus planStatus; + private String planId; + private List planTasks; + private List fileScanTasks; + private List deleteFiles; + + public PlanTableScanResponse() { +// Needed for Jackson Deserialization. + } + + public PlanTableScanResponse( + PlanStatus planStatus, + String planId, + List planTasks, + List fileScanTasks, + List deleteFiles) { +this.planStatus = planStatus; +this.planId = planId; +this.planTasks = planTasks; +this.fileScanTasks = fileScanTasks; +this.deleteFiles = deleteFiles; + } + + public PlanStatus planStatus() { +return planStatus; + } + + public String planId() { +return planId; + } + + public List planTasks() { +return planTasks; + } + + public List fileScanTasks() { +return fileScanTasks; + } + + public List deleteFiles() { +return deleteFiles; + } + + @Override + public void validate() { +Preconditions.checkArgument(planStatus() != null, "invalid response, status can not be null"); +Preconditions.checkArgument( +planStatus() == PlanStatus.SUBMITTED && planId != null, +"Invalid response: planId to be non-null when status is 'submitted"); +Preconditions.checkArgument( +planStatus() != PlanStatus.CANCELLED, +"Invalid response: 'cancelled' is not a valid status for planTableScan"); +Preconditions.checkArgument( +planStatus() != PlanStatus.COMPLETED && (planTasks() != null || fileScanTasks() != null), +"Invalid response: tasks can only be returned in a 'completed' status"); +Preconditions.checkArgument( +deleteFiles() != null && fileScanTasks() == null, +"Invalid response: deleteFiles should only be returned with fileScanTasks that reference them"); + } + + public static class Builder { +public Builder() {} + +private PlanStatus planStatus; +private String planId; +private List planTasks; +private List fileScanTasks; +private List deleteFiles; + +public Builder withPlanStatus(PlanStatus withPlanStatus) { + this.planStatus = withPlanStatus; + return this; +} + +public Builder withPlanId(String withPlanId) { + this.planId = withPlanId; + return this; +} + +public Builder withPlanTasks(List withPlanTasks) { + this.planTasks = withPlanTasks; + return this; +} + +public Builder withFileScanTasks(List withFileScanTasks) { + this.fileScanTasks = withFileScanTasks; + return this; +} + +public Builder withDeleteFiles(List withDeleteFiles) { Review Comment: This may have been done due to checkstyle complaining about hidden fields, my bad. It'd be better if we could have different names -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add scan planning api models and parsers [iceberg]
amogh-jahagirdar commented on code in PR #11180: URL: https://github.com/apache/iceberg/pull/11180#discussion_r1796421177 ## core/src/main/java/org/apache/iceberg/rest/RESTFileScanTaskParser.java: ## @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.rest; + +import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.databind.JsonNode; +import java.io.IOException; +import java.util.List; +import org.apache.iceberg.BaseFileScanTask; +import org.apache.iceberg.DataFile; +import org.apache.iceberg.DeleteFile; +import org.apache.iceberg.FileScanTask; +import org.apache.iceberg.expressions.Expression; +import org.apache.iceberg.expressions.ExpressionParser; +import org.apache.iceberg.expressions.Expressions; +import org.apache.iceberg.expressions.ResidualEvaluator; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; + +public class RESTFileScanTaskParser { + private static final String DATA_FILE = "data-file"; + private static final String DELETE_FILE_REFERENCES = "delete-file-references"; + private static final String RESIDUAL = "residual-filter"; + + private RESTFileScanTaskParser() {} + + public static void toJson( + FileScanTask fileScanTask, List deleteFiles, JsonGenerator generator) + throws IOException { +Preconditions.checkArgument(fileScanTask != null, "Invalid file scan task: null"); +Preconditions.checkArgument(generator != null, "Invalid JSON generator: null"); + +generator.writeStartObject(); +generator.writeFieldName(DATA_FILE); +RESTContentFileParser.toJson(fileScanTask.file(), generator); + +// TODO revisit this logic +if (deleteFiles != null) { + generator.writeArrayFieldStart(DELETE_FILE_REFERENCES); + for (int delIndex = 0; delIndex < deleteFiles.size(); delIndex++) { +generator.writeNumber(delIndex); + } + generator.writeEndArray(); +} +if (fileScanTask.residual() != null) { + generator.writeFieldName(RESIDUAL); + ExpressionParser.toJson(fileScanTask.residual(), generator); +} +generator.writeEndObject(); + } + + public static FileScanTask fromJson(JsonNode jsonNode, List allDeleteFiles) { +Preconditions.checkArgument(jsonNode != null, "Invalid JSON node for file scan task: null"); +Preconditions.checkArgument( +jsonNode.isObject(), "Invalid JSON node for file scan task: non-object (%s)", jsonNode); + +DataFile dataFile = (DataFile) RESTContentFileParser.fromJson(jsonNode.get(DATA_FILE)); + +DeleteFile[] matchedDeleteFiles = null; +List deleteFileReferences = null; +if (jsonNode.has(DELETE_FILE_REFERENCES)) { + ImmutableList.Builder deleteFileReferencesBuilder = ImmutableList.builder(); + JsonNode deletesArray = jsonNode.get(DELETE_FILE_REFERENCES); + for (JsonNode deleteRef : deletesArray) { +deleteFileReferencesBuilder.add(deleteRef); + } + deleteFileReferences = deleteFileReferencesBuilder.build(); +} + +if (deleteFileReferences != null) { + ImmutableList.Builder matchedDeleteFilesBuilder = ImmutableList.builder(); + for (Integer deleteFileIdx : deleteFileReferences) { +matchedDeleteFilesBuilder.add(allDeleteFiles.get(deleteFileIdx)); + } + matchedDeleteFiles = (DeleteFile[]) matchedDeleteFilesBuilder.build().stream().toArray(); +} + +// TODO revisit this in spec +Expression filter = Expressions.alwaysTrue(); +if (jsonNode.has(RESIDUAL)) { + filter = ExpressionParser.fromJson(jsonNode.get(RESIDUAL)); +} + +ResidualEvaluator residualEvaluator = ResidualEvaluator.of(filter); + +// TODO at the time of creation we dont have the schemaString and specString so can we avoid +// setting this +// will need to refresh before returning closed iterable of tasks, for now put place holder null +BaseFileScanTask baseFileScanTask = +new BaseFileScanTask(dataFile, matchedDeleteFiles, null, null, residualEvaluator); Review Comment: we can introduce a `UnboundFileScanTask` which extends BaseFileSca
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
aokolnychyi commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1796422257 ## format/spec.md: ## @@ -51,6 +51,7 @@ Version 3 of the Iceberg spec extends data types and existing metadata structure * New data types: nanosecond timestamp(tz) * Default value support for columns * Multi-argument transforms for partitioning and sorting +* Binary deletion vectors Review Comment: The Puffin spec says `delete vector`, we say `deletion vector` here. Which form is more accurate? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on PR #11238: URL: https://github.com/apache/iceberg/pull/11238#issuecomment-2406552680 I am going to share a PR with some basic implementation that follows this spec. We can use it as an example that will hopefully clarify some questions. Thanks for putting this together, @rdblue! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
aokolnychyi commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1796423298 ## format/spec.md: ## @@ -454,35 +457,40 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo `data_file` is a struct with the following fields: -| v1 | v2 | Field id, name| Type | Description | -| -- | -- |---|--|-| -|| _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | -| _required_ | _required_ | **`100 file_path`** | `string` | Full URI for the file with FS scheme | -| _required_ | _required_ | **`101 file_format`**| `string` | String file format name, avro, orc or parquet | -| _required_ | _required_ | **`102 partition`** | `struct<...>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids | -| _required_ | _required_ | **`103 record_count`** | `long` | Number of records in this file | -| _required_ | _required_ | **`104 file_size_in_bytes`** | `long` | Total file size in bytes | -| _required_ || ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default in v1. Do not write in v2.** | -| _optional_ || ~~**`106 file_ordinal`**~~ | `int` | **Deprecated. Do not write.** | -| _optional_ || ~~**`107 sort_columns`**~~ | `list<112: int>` | **Deprecated. Do not write.** | -| _optional_ | _optional_ | **`108 column_sizes`** | `map<117: int, 118: long>` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro) | -| _optional_ | _optional_ | **`109 value_counts`** | `map<119: int, 120: long>` | Map from column id to number of values in the column (including null and NaN values) | -| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column | -| _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column | -| _optional_ | _optional_ | **`111 distinct_counts`**| `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts | -| _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] | -| _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] | -| _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption | -| _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>`| Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending | -|| _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file | -| _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | +| v1 | v2 | v3 | Field id, name| Type | Description | +| -- | -- | -- |---|--|-| +|| _required_ | _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | Review Comment: I think we
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
aokolnychyi commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1796423298 ## format/spec.md: ## @@ -454,35 +457,40 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo `data_file` is a struct with the following fields: -| v1 | v2 | Field id, name| Type | Description | -| -- | -- |---|--|-| -|| _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | -| _required_ | _required_ | **`100 file_path`** | `string` | Full URI for the file with FS scheme | -| _required_ | _required_ | **`101 file_format`**| `string` | String file format name, avro, orc or parquet | -| _required_ | _required_ | **`102 partition`** | `struct<...>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids | -| _required_ | _required_ | **`103 record_count`** | `long` | Number of records in this file | -| _required_ | _required_ | **`104 file_size_in_bytes`** | `long` | Total file size in bytes | -| _required_ || ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default in v1. Do not write in v2.** | -| _optional_ || ~~**`106 file_ordinal`**~~ | `int` | **Deprecated. Do not write.** | -| _optional_ || ~~**`107 sort_columns`**~~ | `list<112: int>` | **Deprecated. Do not write.** | -| _optional_ | _optional_ | **`108 column_sizes`** | `map<117: int, 118: long>` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro) | -| _optional_ | _optional_ | **`109 value_counts`** | `map<119: int, 120: long>` | Map from column id to number of values in the column (including null and NaN values) | -| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column | -| _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column | -| _optional_ | _optional_ | **`111 distinct_counts`**| `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts | -| _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] | -| _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] | -| _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption | -| _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>`| Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending | -|| _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file | -| _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | +| v1 | v2 | v3 | Field id, name| Type | Description | +| -- | -- | -- |---|--|-| +|| _required_ | _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | Review Comment: I think we
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1796424254 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type Review Comment: We need to mention that these are uncompressed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
aokolnychyi commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1796424825 ## format/spec.md: ## @@ -454,35 +457,40 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo `data_file` is a struct with the following fields: -| v1 | v2 | Field id, name| Type | Description | -| -- | -- |---|--|-| -|| _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | -| _required_ | _required_ | **`100 file_path`** | `string` | Full URI for the file with FS scheme | -| _required_ | _required_ | **`101 file_format`**| `string` | String file format name, avro, orc or parquet | -| _required_ | _required_ | **`102 partition`** | `struct<...>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids | -| _required_ | _required_ | **`103 record_count`** | `long` | Number of records in this file | -| _required_ | _required_ | **`104 file_size_in_bytes`** | `long` | Total file size in bytes | -| _required_ || ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default in v1. Do not write in v2.** | -| _optional_ || ~~**`106 file_ordinal`**~~ | `int` | **Deprecated. Do not write.** | -| _optional_ || ~~**`107 sort_columns`**~~ | `list<112: int>` | **Deprecated. Do not write.** | -| _optional_ | _optional_ | **`108 column_sizes`** | `map<117: int, 118: long>` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro) | -| _optional_ | _optional_ | **`109 value_counts`** | `map<119: int, 120: long>` | Map from column id to number of values in the column (including null and NaN values) | -| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column | -| _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column | -| _optional_ | _optional_ | **`111 distinct_counts`**| `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts | -| _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] | -| _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] | -| _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption | -| _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>`| Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending | -|| _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file | -| _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | +| v1 | v2 | v3 | Field id, name| Type | Description | +| -- | -- | -- |---|--|-| +|| _required_ | _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | +| _required_ | _required_ | _
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
aokolnychyi commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1796426122 ## format/spec.md: ## @@ -454,35 +457,40 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo `data_file` is a struct with the following fields: -| v1 | v2 | Field id, name| Type | Description | -| -- | -- |---|--|-| -|| _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | -| _required_ | _required_ | **`100 file_path`** | `string` | Full URI for the file with FS scheme | -| _required_ | _required_ | **`101 file_format`**| `string` | String file format name, avro, orc or parquet | -| _required_ | _required_ | **`102 partition`** | `struct<...>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids | -| _required_ | _required_ | **`103 record_count`** | `long` | Number of records in this file | -| _required_ | _required_ | **`104 file_size_in_bytes`** | `long` | Total file size in bytes | -| _required_ || ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default in v1. Do not write in v2.** | -| _optional_ || ~~**`106 file_ordinal`**~~ | `int` | **Deprecated. Do not write.** | -| _optional_ || ~~**`107 sort_columns`**~~ | `list<112: int>` | **Deprecated. Do not write.** | -| _optional_ | _optional_ | **`108 column_sizes`** | `map<117: int, 118: long>` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro) | -| _optional_ | _optional_ | **`109 value_counts`** | `map<119: int, 120: long>` | Map from column id to number of values in the column (including null and NaN values) | -| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column | -| _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column | -| _optional_ | _optional_ | **`111 distinct_counts`**| `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts | -| _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] | -| _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] | -| _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption | -| _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>`| Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending | -|| _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file | -| _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | +| v1 | v2 | v3 | Field id, name| Type | Description | +| -- | -- | -- |---|--|-| +|| _required_ | _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | +| _required_ | _required_ | _
Re: [I] Feature: S3 Remote Signing [iceberg-rust]
flaneur2020 commented on issue #506: URL: https://github.com/apache/iceberg-rust/issues/506#issuecomment-2404718870 the pr which allows passing a `Sign` trait has already been merged into reqsign. however, there still needs some work to let it integrated into opendal. it may need some more time before this issue is ready to be worked on. sorry i'd unassign myself from this issue at the moment 🥲, i'll keep an eye on the progress and sync this issue when it's ready to be tackled again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Tests alignment for the Flink Sink v2-based implemenation (IcebergSink) [iceberg]
arkadius commented on PR #11219: URL: https://github.com/apache/iceberg/pull/11219#issuecomment-2404641938 > Hi @arkadius I have started working in backporting the RANGE distribution to the IcebergSink. The unit tests in my code will benefit from the new marker interface you are introducing in this PR, so I'd like to merge this one so that I can rebase properly. I see this PR is still on "draft". Is anything pending? > > @pvary does this PR look good to you, or you have anything pending that would like to be done? The PR is looking good to me. > > Also, please @pvary and @stevenzwu please help starting the CI pipelines on this PR. @rodmeneses I think that it would be better if you copied the parts of the code from this PR into your PR. I can rebase into your changes after you merge them. This PR isn't finished yet. I have a plan how the final shape of unit test could look like to make them easier to maintain and to fill some gaps in `IcebergSink`'s tests but before I show it, I would like to merge this PR: #11249 (and maybe this: #11244). Thanks to that, it will be more clear what is the scope of this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] OpenAPI: Standardize credentials in loadTable/loadView responses [iceberg]
nastra commented on code in PR #10722: URL: https://github.com/apache/iceberg/pull/10722#discussion_r1795062381 ## open-api/rest-catalog-open-api.yaml: ## @@ -3103,6 +3103,81 @@ components: uuid: type: string +ADLSCredentials: + type: object + allOf: +- $ref: '#/components/schemas/Credentials' + required: +- type + properties: +type: + type: string + enum: [ "adls" ] +account-name: + type: string +account-key: + type: string +sas-token: + type: string +expires-at-ms: + type: integer + format: int64 + +GCSCredentials: + type: object + allOf: +- $ref: '#/components/schemas/Credentials' + required: +- type +- token +- expires-at-ms + properties: +type: + type: string + enum: [ "gcs" ] +token: + type: string +expires-at-ms: + type: integer + format: int64 + +S3Credentials: + type: object + allOf: +- $ref: '#/components/schemas/Credentials' + required: Review Comment: based on recent discussions the feedback was that we don't want to have anything storage-specific in the OpenAPI spec (other than documenting the different storage configurations, which is handled by https://github.com/apache/iceberg/pull/10576). Therefore I've updated the PR and made it flexible enough so that we could still pass different credentials based on a given `prefix`. That should hopefully work for everyone. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: Add expiry time config to REST table load [iceberg]
nastra commented on code in PR #10873: URL: https://github.com/apache/iceberg/pull/10873#discussion_r1795610952 ## open-api/rest-catalog-open-api.py: ## @@ -1103,6 +1103,7 @@ class LoadTableResult(BaseModel): ## General Configurations - `token`: Authorization bearer token to use for table requests if OAuth2 security is enabled +- `expires-at-ms`: if present, specifies timestamp in milliseconds when credentials for storage(AWS S3/Azure/GCP), specified in config would expire Review Comment: I don't think it's actually correct to have this at this level. This should be specific to the underlying Storage provider. GCS for example has `gcs.oauth2.token-expires-at`. That being said, the respective FileIO properties of each storage provider should have a similar property to the one used for GCS -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: add variant type [iceberg]
sfc-gh-aixu commented on code in PR #10831: URL: https://github.com/apache/iceberg/pull/10831#discussion_r1795791601 ## format/spec.md: ## @@ -178,6 +178,8 @@ A **`list`** is a collection of values with some element type. The element field A **`map`** is a collection of key-value pairs with a key type and a value type. Both the key field and value field each have an integer id that is unique in the table schema. Map keys are required and map values can be either optional or required. Both map keys and map values may be any type, including nested types. +A **`variant`** is a type to represent semi-structured data. A variant value can store a value of any other type, including any primitive, struct, list or map values. The variant value is encoded in its own binary [encoding](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md). Variant type is added in [v3](#version-3). Review Comment: Updated. Currently I'm linking to the commit which merges the spec. We can update to a released Parquet branch later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add REST Catalog tests to Spark 3.5 integration test [iceberg]
haizhou-zhao commented on code in PR #11093: URL: https://github.com/apache/iceberg/pull/11093#discussion_r1795891887 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/TestBaseWithCatalog.java: ## @@ -59,18 +71,47 @@ protected static Object[][] parameters() { } @BeforeAll - public static void createWarehouse() throws IOException { + public static void createWarehouseAndStartRest() throws IOException { Review Comment: ack -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add REST Catalog tests to Spark 3.5 integration test [iceberg]
haizhou-zhao commented on code in PR #11093: URL: https://github.com/apache/iceberg/pull/11093#discussion_r1795890640 ## spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMetadataTables.java: ## @@ -740,6 +743,11 @@ private boolean partitionMatch(Record file, String partValue) { @TestTemplate public void metadataLogEntriesAfterReplacingTable() throws Exception { +if (Set.of(ICEBERG_CATALOG_TYPE_REST).contains(catalogConfig.get(ICEBERG_CATALOG_TYPE))) { Review Comment: thx, replacing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add REST Catalog tests to Spark 3.5 integration test [iceberg]
haizhou-zhao commented on code in PR #11093: URL: https://github.com/apache/iceberg/pull/11093#discussion_r1795881502 ## spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMetadataTables.java: ## @@ -521,7 +524,7 @@ public void testFilesTableTimeTravelWithSchemaEvolution() throws Exception { optional(3, "category", Types.StringType.get(; spark.createDataFrame(newRecords, newSparkSchema).coalesce(1).writeTo(tableName).append(); - +table.refresh(); Review Comment: Yes, RESTTableOperations and other TableOperations has different mechanisms of refreshing metadata. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Handling NO Coordinator Scenario and Data Loss in the current Design [iceberg]
kumarpritam863 opened a new pull request, #11298: URL: https://github.com/apache/iceberg/pull/11298 This PR combines the issues I have already handled by these two PRs: 1. **No Coordinator Scenario in ICR Mode -> https://github.com/apache/iceberg/pull/11288** 2. **Data Loss in ICR Mode -> https://github.com/apache/iceberg/pull/11289** With this PR, I have made few changes to handle both the issues in more **robust**, **resilient** and **performant** ways. I have another design change proposal with this **[PR](https://github.com/apache/iceberg/pull/11290)** which handles much more scenarios without doing any OffsetSync or other efforts in much more resilient ways. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add REST Catalog tests to Spark 3.5 integration test [iceberg]
haizhou-zhao commented on code in PR #11093: URL: https://github.com/apache/iceberg/pull/11093#discussion_r1795896887 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/TestBaseWithCatalog.java: ## @@ -59,18 +71,47 @@ protected static Object[][] parameters() { } @BeforeAll - public static void createWarehouse() throws IOException { + public static void createWarehouseAndStartRest() throws IOException { TestBaseWithCatalog.warehouse = File.createTempFile("warehouse", null); assertThat(warehouse.delete()).isTrue(); +try { Review Comment: ack -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Handling NO Coordinator Scenario and Data Loss in the current Design [iceberg]
kumarpritam863 commented on PR #11298: URL: https://github.com/apache/iceberg/pull/11298#issuecomment-2405762502 @bryanck can we please review this. This PR handles the most of the ICR issues in the current design itself with slight modifications. We can discuss more on this [PR](https://github.com/apache/iceberg/pull/11290) which proposes more robust and resilient and connect friendly design. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add scan planning api models and parsers [iceberg]
singhpk234 commented on code in PR #11180: URL: https://github.com/apache/iceberg/pull/11180#discussion_r1795908515 ## core/src/main/java/org/apache/iceberg/rest/RESTContentFileParser.java: ## @@ -0,0 +1,250 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.rest; + +import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.databind.JsonNode; +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.Map; +import org.apache.iceberg.ContentFile; +import org.apache.iceberg.DataFile; +import org.apache.iceberg.FileContent; +import org.apache.iceberg.FileFormat; +import org.apache.iceberg.GenericDataFile; +import org.apache.iceberg.GenericDeleteFile; +import org.apache.iceberg.Metrics; +import org.apache.iceberg.PartitionData; +import org.apache.iceberg.SingleValueParser; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.util.JsonUtil; + +public class RESTContentFileParser { + private static final String SPEC_ID = "spec-id"; + private static final String CONTENT = "content"; + private static final String FILE_PATH = "file-path"; + private static final String FILE_FORMAT = "file-format"; + private static final String PARTITION = "partition"; + private static final String RECORD_COUNT = "record-count"; + private static final String FILE_SIZE_IN_BYTES = "file-size-in-bytes"; + private static final String COLUMN_SIZES = "column-sizes"; + private static final String VALUE_COUNTS = "value-counts"; + private static final String NULL_VALUE_COUNTS = "null-value-counts"; + private static final String NAN_VALUE_COUNTS = "nan-value-counts"; + private static final String LOWER_BOUNDS = "lower-bounds"; + private static final String UPPER_BOUNDS = "upper-bounds"; + private static final String KEY_METADATA = "key-metadata"; + private static final String SPLIT_OFFSETS = "split-offsets"; + private static final String EQUALITY_IDS = "equality-ids"; + private static final String SORT_ORDER_ID = "sort-order-id"; + + private RESTContentFileParser() {} + + public static String toJson(ContentFile contentFile) { +return JsonUtil.generate( +generator -> RESTContentFileParser.toJson(contentFile, generator), false); + } + + public static void toJson(ContentFile contentFile, JsonGenerator generator) + throws IOException { +Preconditions.checkArgument(contentFile != null, "Invalid content file: null"); +Preconditions.checkArgument(generator != null, "Invalid JSON generator: null"); + +generator.writeStartObject(); + +generator.writeNumberField(SPEC_ID, contentFile.specId()); +generator.writeStringField(CONTENT, contentFile.content().name()); +generator.writeStringField(FILE_PATH, contentFile.path().toString()); +generator.writeStringField(FILE_FORMAT, contentFile.format().name()); + +generator.writeFieldName(PARTITION); + +// TODO at the time of serialization we dont have the partition spec we just have spec id. +// we will need to get the spec from table metadata using spec id. +// or we will need to send parition spec, put null here for now until refresh +SingleValueParser.toJson(null, contentFile.partition(), generator); + +generator.writeNumberField(FILE_SIZE_IN_BYTES, contentFile.fileSizeInBytes()); + +metricsToJson(contentFile, generator); + +if (contentFile.keyMetadata() != null) { + generator.writeFieldName(KEY_METADATA); + SingleValueParser.toJson(DataFile.KEY_METADATA.type(), contentFile.keyMetadata(), generator); +} + +if (contentFile.splitOffsets() != null) { + JsonUtil.writeLongArray(SPLIT_OFFSETS, contentFile.splitOffsets(), generator); +} + +if (contentFile.equalityFieldIds() != null) { + JsonUtil.writeIntegerArray(EQUALITY_IDS, contentFile.equalityFieldIds(), generator); +} + +if (contentFile.sortOrderId() != null) { + generator.writeNumberField(SORT_ORDER_ID, contentFile.sortOrderId()); +} + +generator.writeEndObject(); + } + + public static ContentFile fromJson(JsonNode jsonNode) { +Preconditions.checkArgument(jsonNode != null, "Inv
Re: [PR] Table Scan Delete File Handling: Positional Delete Support [iceberg-rust]
sdd commented on PR #652: URL: https://github.com/apache/iceberg-rust/pull/652#issuecomment-2405778274 @Xuanwo, @liurenjie1024: This is now ready to review, PTAL when you guys get chance. Look forward to your feedback 😁 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Arrow: Fix indexing in Parquet dictionary encoded values readers [iceberg]
wypoon commented on code in PR #11247: URL: https://github.com/apache/iceberg/pull/11247#discussion_r1795958912 ## spark/v3.5/spark/src/test/resources/decimal_dict_and_plain_encoding.parquet: ## Review Comment: I also tried ``` try (FileAppender writer = Parquet.write(Files.localOutput(parquetFile)) .schema(schema) .createWriterFunc(ParquetAvroWriter::buildWriter) .set(PARQUET_DICT_SIZE_BYTES, "2048") .set(PARQUET_PAGE_ROW_LIMIT, "100") .build()) { writer.addAll(records); } ``` and stepped through the `writer` calls in a debugger. The following chain is called to get the `ValuesWriter` for the decimal column: https://github.com/apache/parquet-java/blob/apache-parquet-1.13.1/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java#L84 https://github.com/apache/parquet-java/blob/apache-parquet-1.13.1/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L167 https://github.com/apache/parquet-java/blob/apache-parquet-1.13.1/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultValuesWriterFactory.java#L52 https://github.com/apache/parquet-java/blob/apache-parquet-1.13.1/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L55 https://github.com/apache/parquet-java/blob/apache-parquet-1.13.1/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L80 which is the code I cited previously: ``` private ValuesWriter getFixedLenByteArrayValuesWriter(ColumnDescriptor path) { // dictionary encoding was not enabled in PARQUET 1.0 return new FixedLenByteArrayPlainValuesWriter(path.getTypeLength(), parquetProperties.getInitialSlabSize(), parquetProperties.getPageSizeThreshold(), parquetProperties.getAllocator()); } ``` I am using the `iceberg-parquet` APIs to create a writer, but eventually it calls `parquet-java` code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Arrow: Fix indexing in Parquet dictionary encoded values readers [iceberg]
wypoon commented on code in PR #11247: URL: https://github.com/apache/iceberg/pull/11247#discussion_r1795958912 ## spark/v3.5/spark/src/test/resources/decimal_dict_and_plain_encoding.parquet: ## Review Comment: I also tried ``` try (FileAppender writer = Parquet.write(Files.localOutput(parquetFile)) .schema(schema) .createWriterFunc(ParquetAvroWriter::buildWriter) .set(PARQUET_DICT_SIZE_BYTES, "2048") .set(PARQUET_PAGE_ROW_LIMIT, "100") .build()) { writer.addAll(records); } ``` and debugged what happens in `writer`. The following chain is called to get the `ValuesWriter` for the decimal column: https://github.com/apache/parquet-java/blob/apache-parquet-1.13.1/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java#L84 https://github.com/apache/parquet-java/blob/apache-parquet-1.13.1/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L167 https://github.com/apache/parquet-java/blob/apache-parquet-1.13.1/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultValuesWriterFactory.java#L52 https://github.com/apache/parquet-java/blob/apache-parquet-1.13.1/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L55 https://github.com/apache/parquet-java/blob/apache-parquet-1.13.1/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L80 which is the code I cited previously: ``` private ValuesWriter getFixedLenByteArrayValuesWriter(ColumnDescriptor path) { // dictionary encoding was not enabled in PARQUET 1.0 return new FixedLenByteArrayPlainValuesWriter(path.getTypeLength(), parquetProperties.getInitialSlabSize(), parquetProperties.getPageSizeThreshold(), parquetProperties.getAllocator()); } ``` I am using the `iceberg-parquet` APIs to create a writer, but eventually it calls `parquet-java` code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Support wasb[s] paths in ADLSFileIO [iceberg]
RussellSpitzer commented on code in PR #11294: URL: https://github.com/apache/iceberg/pull/11294#discussion_r1795970799 ## azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSLocation.java: ## @@ -18,24 +18,34 @@ */ package org.apache.iceberg.azure.adlsv2; +import java.net.URI; +import java.net.URISyntaxException; import java.util.Optional; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.iceberg.exceptions.ValidationException; import org.apache.iceberg.relocated.com.google.common.base.Preconditions; /** - * This class represents a fully qualified location in Azure expressed as a URI. + * This class represents a fully qualified location in Azure Data Lake Storage, expressed as a URI. * * Locations follow the conventions used by Hadoop's Azure support, i.e. * - * {@code abfs[s]://[@]/} + * {@code abfs[s]://[@].dfs.core.windows.net/} Review Comment: Is this a change? or is Storage Account Host equivalent to `storageAccount.dfs.core.windows.net` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Support wasb[s] paths in ADLSFileIO [iceberg]
RussellSpitzer commented on code in PR #11294: URL: https://github.com/apache/iceberg/pull/11294#discussion_r1795977105 ## azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSLocation.java: ## @@ -53,19 +63,17 @@ class ADLSLocation { ValidationException.check(matcher.matches(), "Invalid ADLS URI: %s", location); -String authority = matcher.group(1); -String[] parts = authority.split("@", -1); -if (parts.length > 1) { - this.container = parts[0]; - this.storageAccount = parts[1]; -} else { - this.container = null; - this.storageAccount = authority; +try { + URI uri = new URI(location); + this.container = uri.getUserInfo(); + // storage account name is the first part of the host + int accountSplit = uri.getHost().indexOf('.'); + String storageAccountName = uri.getHost().substring(0, accountSplit); + this.storageAccount = String.format("%s.dfs.core.windows.net", storageAccountName); + this.path = uri.getPath().length() > 1 ? uri.getRawPath().substring(1) : ""; +} catch (URISyntaxException e) { + throw new ValidationException("Invalid URI: %s", location); Review Comment: This is validation we didn't have before correct? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Spark, Flink: Test failures after updating JUnit 5.10 to 5.11 [iceberg]
tomtongue commented on issue #11296: URL: https://github.com/apache/iceberg/issues/11296#issuecomment-2404426076 `iceberg-core` and `iceberg-mr` also have this issue, however created the PR (https://github.com/apache/iceberg/pull/11262) to fix this. For Spark and Flink, there are a lot of tests required to be fixed, so add `junit.platform.reflection.search.useLegacySemantics` and will make all test cases in Spark and Flink compatible with JUnit 5.11 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org