rdblue commented on code in PR #5432: URL: https://github.com/apache/iceberg/pull/5432#discussion_r1015857790
########## format/gcm-stream-spec.md: ########## @@ -0,0 +1,87 @@ +--- +title: "AES GCM Stream Spec" +url: gcm-stream-spec +toc: true +disableSidebar: true +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# AES GCM Stream (AGS) file format extension + +## Background and Motivation + +Iceberg supports a number of data file formats. Two of these formats (Parquet and ORC) have built-in encryption capabilities, that allow to protect sensitive information in the data files. However, besides the data files, Iceberg tables also have metadata files, that keep sensitive information too (e.g., min/max values in manifest files, or bloom filter bitsets in puffin files). Metadata file formats (AVRO, JSON, Puffin) don't have encryption support. + +Moreover, with the exception of Parquet, no Iceberg data or metadata file format supports integrity verification, required for end-to-end tamper proofing of Iceberg tables. + +This document specifies details of a simple file format extension that adds encryption and tamper-proofing to any existing file format. + +## Goals + +* Metadata encryption: enable encryption of manifests, manifest lists, snapshots and stats. +* Avro data encryption: enable encryption of data files in tables that use the Avro format. +* Tamper proofing of Iceberg data and metadata files. + +## Overview + +The output stream, produced by a metadata or data writer, is split into equal-size blocks (plus residue). Each block is enciphered (encrypted/signed) with a given encryption key, and stored in a file in the AGS format. Upon reading, the stored cipherblocks are verified for integrity; then decrypted and passed to metadata or data readers. + +## Encryption algorithm + +AGS uses the standard AEG GCM cipher, and supports all AES key sizes: 128, 192 and 256 bits. + +AES GCM is an authenticated encryption. Besides data confidentiality (encryption), it supports two levels of integrity verification (authentication): of the data (default), and of the data combined with an optional AAD (“additional authenticated data”). An AAD is a free text to be authenticated, together with the data. The structure of AGS AADs is described below. + +AES GCM requires a unique vector to be provided for each encrypted block. In this document, the unique input to GCM encryption is called nonce (“number used once”). AGS encryption uses the RBG-based (random bit generator) nonce construction as defined in the section 8.2.2 of the NIST SP 800-38D document. For each encrypted block, AGS generates a unique nonce with a length of 12 bytes (96 bits). + +## Format specification + +### File structure + +The AGS-encrypted files have the following structure + +``` +Magic BlockLength CipherBlock₁ CipherBlock₂ ... CipherBlockₙ +``` + +where + +- `Magic` is four bytes 0x41, 0x47, 0x53, 0x31 ("AGS1", short for: AES GCM Stream, version 1) +- `BlockLength` is four bytes (little endian) integer keeping the length of the equal-size split blocks before encryption. The length is specified in bytes. +- `CipherBlockᵢ` is the i-th enciphered block in the file, with the structure defined below. + +### Cipher Block structure + +Cipher blocks have the following structure + +| nonce | ciphertext | tag | +|-------|------------|-----| Review Comment: I think that this should work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org