ggershinsky commented on code in PR #16527:
URL: https://github.com/apache/iceberg/pull/16527#discussion_r3292244790


##########
format/spec.md:
##########
@@ -1213,6 +1066,45 @@ Notes:
 
 1. The format of encrypted key metadata is determined by the table's 
encryption scheme and can be a wrapped format specific to the table's KMS 
provider.
 
+#### Standard Key Metadata
+
+The `key_metadata` field in manifest entries stores per-file encryption key 
material as a binary blob. To enable cross-implementation interoperability, the 
standard encryption scheme defines the following binary format for this field:
+
+```
+VersionByte Payload
+```
+
+where:
+
+* `VersionByte` is a single byte indicating the key metadata schema version. 
Currently, the only valid version is `0x01`.
+* `Payload` is an Avro binary-encoded record (not a container file — only the 
raw binary encoding of the fields) using the schema for the given version.
+
+The Avro schema for version 1 is a record with the following fields, in order:
+
+| Field name | Avro type | Required | Description |
+|---|---|---|---|
+| **`encryption_key`** | `bytes` | _required_ | The data encryption key (DEK) 
for this file. Must be 16, 24, or 32 bytes (corresponding to AES-128, AES-192, 
or AES-256). |
+| **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for [AES 
GCM Stream](gcm-stream-spec.md) integrity protection. |

Review Comment:
   AAD prefix is used not only in AES GCM Stream files, but also in encrypted 
Parquet files 
(https://parquet.apache.org/docs/file-format/data-pages/encryption/ or 
https://github.com/apache/parquet-format/blob/master/Encryption.md)



##########
format/spec.md:
##########
@@ -1213,6 +1066,45 @@ Notes:
 
 1. The format of encrypted key metadata is determined by the table's 
encryption scheme and can be a wrapped format specific to the table's KMS 
provider.
 
+#### Standard Key Metadata
+
+The `key_metadata` field in manifest entries stores per-file encryption key 
material as a binary blob. To enable cross-implementation interoperability, the 
standard encryption scheme defines the following binary format for this field:
+
+```
+VersionByte Payload
+```
+
+where:
+
+* `VersionByte` is a single byte indicating the key metadata schema version. 
Currently, the only valid version is `0x01`.
+* `Payload` is an Avro binary-encoded record (not a container file — only the 
raw binary encoding of the fields) using the schema for the given version.
+
+The Avro schema for version 1 is a record with the following fields, in order:
+
+| Field name | Avro type | Required | Description |
+|---|---|---|---|
+| **`encryption_key`** | `bytes` | _required_ | The data encryption key (DEK) 
for this file. Must be 16, 24, or 32 bytes (corresponding to AES-128, AES-192, 
or AES-256). |
+| **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for [AES 
GCM Stream](gcm-stream-spec.md) integrity protection. |
+| **`file_length`** | `long` | _optional_ | The plaintext file length before 
encryption. Used to detect truncation attacks (see [AES GCM Stream file 
length](gcm-stream-spec.md#file-length)). |

Review Comment:
   This keeps file length after encryption, 
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/encryption/AesGcmInputFile.java#L45
   
   Only for AES GCM Stream files. Not set/used for encrypted Parquet data files.



##########
format/spec.md:
##########
@@ -1213,6 +1066,45 @@ Notes:
 
 1. The format of encrypted key metadata is determined by the table's 
encryption scheme and can be a wrapped format specific to the table's KMS 
provider.
 
+#### Standard Key Metadata
+
+The `key_metadata` field in manifest entries stores per-file encryption key 
material as a binary blob. To enable cross-implementation interoperability, the 
standard encryption scheme defines the following binary format for this field:
+
+```
+VersionByte Payload
+```
+
+where:
+
+* `VersionByte` is a single byte indicating the key metadata schema version. 
Currently, the only valid version is `0x01`.
+* `Payload` is an Avro binary-encoded record (not a container file — only the 
raw binary encoding of the fields) using the schema for the given version.
+
+The Avro schema for version 1 is a record with the following fields, in order:
+
+| Field name | Avro type | Required | Description |
+|---|---|---|---|
+| **`encryption_key`** | `bytes` | _required_ | The data encryption key (DEK) 
for this file. Must be 16, 24, or 32 bytes (corresponding to AES-128, AES-192, 
or AES-256). |
+| **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for [AES 
GCM Stream](gcm-stream-spec.md) integrity protection. |
+| **`file_length`** | `long` | _optional_ | The plaintext file length before 
encryption. Used to detect truncation attacks (see [AES GCM Stream file 
length](gcm-stream-spec.md#file-length)). |
+
+The AAD prefix is combined with a 4-byte little-endian block index to form the 
AAD for each AES GCM Stream cipher block, as described in the [AES GCM Stream 
AAD section](gcm-stream-spec.md#additional-authenticated-data).

Review Comment:
   in Parquet encryption, this works differently, 
https://parquet.apache.org/docs/file-format/data-pages/encryption/



##########
format/spec.md:
##########
@@ -667,7 +664,7 @@ The `data_file` struct consists of the following fields:
     | _optional_ | _optional_ |            | ~~**`111  distinct_counts`**~~    
| `map<123: int, 124: long>`                                                  | 
**Deprecated. Do not write.** |
     | _optional_ | _optional_ | _optional_ | **`125  lower_bounds`**           
| `map<126: int, 127: binary>`                                                | 
Map from column id to lower bound in the column serialized as binary [1]. Each 
value must be less than or equal to all non-null, non-NaN values in the column 
for the file [2] |
     | _optional_ | _optional_ | _optional_ | **`128  upper_bounds`**           
| `map<129: int, 130: binary>`                                                | 
Map from column id to upper bound in the column serialized as binary [1]. Each 
value must be greater than or equal to all non-null, non-Nan values in the 
column for the file [2] |
-    | _optional_ | _optional_ | _optional_ | **`131  key_metadata`**           
| `binary`                                                                    | 
Implementation-specific key metadata for encryption |
+    | _optional_ | _optional_ | _optional_ | **`131  key_metadata`**           
| `binary`                                                                    | 
Per-file encryption key metadata. See [Standard Key 
Metadata](#standard-key-metadata) for the interoperable format used by the 
standard encryption scheme. |

Review Comment:
   there is also a `key_metadata` field in the Manifest File struct (field id 
519)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to