ggershinsky commented on code in PR #9752: URL: https://github.com/apache/iceberg/pull/9752#discussion_r1497429004
########## core/src/main/java/org/apache/iceberg/encryption/AesGcmInputFile.java: ########## @@ -20,39 +20,33 @@ import org.apache.iceberg.io.InputFile; import org.apache.iceberg.io.SeekableInputStream; -import org.apache.iceberg.relocated.com.google.common.base.Preconditions; public class AesGcmInputFile implements InputFile { private final InputFile sourceFile; private final byte[] dataKey; private final byte[] fileAADPrefix; private long plaintextLength; + /** + * Important: sourceFile.getLength() must return the verified plaintext content length, not the + * physical file size after encryption. This protects against tampering with the file size in + * untrusted storage systems. + */ public AesGcmInputFile(InputFile sourceFile, byte[] dataKey, byte[] fileAADPrefix) { this.sourceFile = sourceFile; this.dataKey = dataKey; this.fileAADPrefix = fileAADPrefix; - this.plaintextLength = -1; + this.plaintextLength = sourceFile.getLength(); Review Comment: A couple of observations 1. AesGcmInputFile objects are produced by the `EncryptionManager.decrypt(EncryptedInputFile encrypted)` function. This function is only given an EncryptedInputFile object, comprised of the source InputFile and its KeyMetadata. So the only way for the decrypt function to get the file length information is to call the source `InputFile.getLength()`. We can't provide the decrypt function with the file length via additional channels. 2. But thats not a problem in the current Iceberg code - because the source InputFile length is fully controlled, and comes from a trusted place (not from the underlying object store). More specifically, the sequence is: a) **writing**: _public ManifestFile ManifestWriter.toManifestFile() return new GenericManifestFile(file.location(),_ [writer.length()](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestWriter.java#L194), -> _public long org.apache.iceberg.avro.AvroFileAppender.length()_ [ return stream.getPos();](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java#L83) So the length, recorded in the ManifestFile objects for each encrypted manifest, is the plaintext length (and not the actual file length), since the `AesGcmOutputStream.getPos()` returns the plaintext position. b) **reading**: _InputFile ManifestFiles.newInputFile(FileIO io, **ManifestFile manifest**) return InputFile input =_ [io.newInputFile(manifest)](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestFiles.java#L349); -> _public InputFile EncryptingFileIO.newInputFile(ManifestFile manifest) return newDecryptingInputFile(manifest.path(), **manifest.length()**, manifest.keyMetadata());_ -> _public InputFile EncryptingFileIO.newDecryptingInputFile(String path, long plainLength, ByteBuffer buffer) { InputFile inputFile = **io.newInputFile(path, plainLength);** return em.decrypt(wrap(inputFile, buffer));_ So that the length in the `InputFile sourceFile`, given to the AesGcmInputFile constructor, comes from a safe ManifestFile object (since manifest lists are encrypted/signed), and not from an unsafe storage. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org