mbutrovich commented on code in PR #2584:
URL: https://github.com/apache/iceberg-rust/pull/2584#discussion_r3414594167
##########
crates/iceberg/src/arrow/reader/pipeline.rs:
##########
@@ -431,14 +437,44 @@ impl ArrowReader {
)
.with_parquet_read_options(parquet_read_options);
- let arrow_metadata = ArrowReaderMetadata::load_async(&mut reader,
Default::default())
+ let arrow_reader_options =
Self::build_arrow_reader_options(key_metadata)?;
+
+ let arrow_metadata = ArrowReaderMetadata::load_async(&mut reader,
arrow_reader_options)
.await
.map_err(|e| {
Error::new(ErrorKind::Unexpected, "Failed to load Parquet
metadata").with_source(e)
})?;
Ok((reader, arrow_metadata))
}
+
+ /// Builds `ArrowReaderOptions`, adding `FileDecryptionProperties` when
+ /// key metadata is present for Parquet Modular Encryption.
+ fn build_arrow_reader_options(key_metadata: Option<&[u8]>) ->
Result<ArrowReaderOptions> {
Review Comment:
Resolution is hardcoded to `StandardKeyMetadata::decode(km)` -> plaintext
DEK (line 456-458). Worth noting the spec defines `key_metadata` as
"implementation-specific key metadata for encryption" (field 131), and
`StandardKeyMetadata` is the reference implementation's format rather than a
spec-mandated one. So baking in that single decode path means non-standard
key-management schemes (for example, where `key_metadata` is a key reference
resolved from a KMS at read time rather than an embedded DEK) cannot plug in.
Would it be worth keeping the `StandardKeyMetadata` path as the default but
exposing the resolution step as a trait, so other implementations can supply
their own? Roughly:
```rust
#[async_trait]
pub trait FileDecryptionHandler: Send + Sync {
async fn file_decryption_properties(&self, key_metadata: &[u8]) ->
Result<FileDecryptionProperties>;
}
```
One reason to make it `async`: a scheme that resolves the DEK from a KMS at
read time (where `key_metadata` carries a wrapped DEK plus a key id, not a
plaintext DEK) needs a network round-trip during resolution. arrow-rs has a
lower-level `parquet::encryption::decrypt::KeyRetriever`, but its
`retrieve_key(&self, key_metadata) -> Result<Vec<u8>>` is synchronous, so it
cannot perform an async unwrap inline. The general path is therefore to resolve
the key (async) and hand arrow-rs an explicit-key `FileDecryptionProperties`
via `builder(dek)` (exactly what this PR already does at line 457), with the
default `StandardKeyMetadata` impl just being the synchronous no-KMS case.
Iceberg Java structures it this way: `EncryptionManager` is an interface,
with `StandardEncryptionManager` as the spec-compliant default, selected per
table and pluggable via the `encryption.kms-impl` catalog property. Worth
noting the seam there is at the manager / `decrypt` level, not only at the KMS
client, because an alternative scheme may differ in both the `key_metadata`
layout and the unwrap step, not just the KMS backend.
Is a hook like this already planned for a later PR, or intentionally
deferred? Mainly want to avoid the concrete decode path becoming hard to
retrofit once the rest of the series builds on it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]