rdblue commented on code in PR #9752: URL: https://github.com/apache/iceberg/pull/9752#discussion_r1501944749
########## core/src/main/java/org/apache/iceberg/encryption/AesGcmInputFile.java: ########## @@ -20,39 +20,33 @@ import org.apache.iceberg.io.InputFile; import org.apache.iceberg.io.SeekableInputStream; -import org.apache.iceberg.relocated.com.google.common.base.Preconditions; public class AesGcmInputFile implements InputFile { private final InputFile sourceFile; private final byte[] dataKey; private final byte[] fileAADPrefix; private long plaintextLength; + /** + * Important: sourceFile.getLength() must return the verified plaintext content length, not the + * physical file size after encryption. This protects against tampering with the file size in + * untrusted storage systems. + */ public AesGcmInputFile(InputFile sourceFile, byte[] dataKey, byte[] fileAADPrefix) { this.sourceFile = sourceFile; this.dataKey = dataKey; this.fileAADPrefix = fileAADPrefix; - this.plaintextLength = -1; + this.plaintextLength = sourceFile.getLength(); Review Comment: Another problem I just ran into is that if we store the plaintext length, then `EncryptingFileIO` doesn't know what length to pass into `FileIO.newInputFile(String, long)` for the encrypted file length. Encrypted length depends on the encryption scheme and we don't know the details of the scheme used by some `EncryptionManager`. That makes me think that we should store the encrypted length in Iceberg metadata. That is also intuitive because the metadata length always matches the object storage length. We'll have to find out how to get the encrypted length in the data and manifest write paths. I think we should probably introduce a method to return the final size of a stream, whether that stream is encrypted or not. That would return the encrypted length for AES GCM streams. We can't change the values returned by `getPos` because those are used to produce offsets embedded in Parquet file metadata. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org