ggershinsky commented on code in PR #9752:
URL: https://github.com/apache/iceberg/pull/9752#discussion_r1497429004


##########
core/src/main/java/org/apache/iceberg/encryption/AesGcmInputFile.java:
##########
@@ -20,39 +20,33 @@
 
 import org.apache.iceberg.io.InputFile;
 import org.apache.iceberg.io.SeekableInputStream;
-import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
 
 public class AesGcmInputFile implements InputFile {
   private final InputFile sourceFile;
   private final byte[] dataKey;
   private final byte[] fileAADPrefix;
   private long plaintextLength;
 
+  /**
+   * Important: sourceFile.getLength() must return the verified plaintext 
content length, not the
+   * physical file size after encryption. This protects against tampering with 
the file size in
+   * untrusted storage systems.
+   */
   public AesGcmInputFile(InputFile sourceFile, byte[] dataKey, byte[] 
fileAADPrefix) {
     this.sourceFile = sourceFile;
     this.dataKey = dataKey;
     this.fileAADPrefix = fileAADPrefix;
-    this.plaintextLength = -1;
+    this.plaintextLength = sourceFile.getLength();

Review Comment:
   A couple of observations
   
   1. AesGcmInputFile objects are produced by the 
`EncryptionManager.decrypt(EncryptedInputFile encrypted)` function. This 
function is only given an EncryptedInputFile object, comprised of the source 
InputFile and its KeyMetadata. So the only way for the decrypt function to get 
the file length information is to call the source `InputFile.getLength()`. We 
can't provide the decrypt function with the file length via additional channels.
   
   2. But thats not a problem in the current Iceberg code - because the source 
InputFile length is fully controlled, and comes from a trusted place (not from 
the underlying object store). More specifically, the sequence is:
   
   a) **writing**:
   
   _public ManifestFile ManifestWriter.toManifestFile()
       return new GenericManifestFile(file.location(),_   
[writer.length()](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestWriter.java#L194),
   ->
   _public long org.apache.iceberg.avro.AvroFileAppender.length()_
          [ return 
stream.getPos();](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java#L83)
   
   So the length, recorded in the ManifestFile objects for each encrypted 
manifest, is the plaintext length (and not the actual file length), since the 
`AesGcmOutputStream.getPos()` returns the plaintext position.
   
   b) **reading**: 
   
   _InputFile ManifestFiles.newInputFile(FileIO io, **ManifestFile manifest**) 
       return InputFile input =_ 
[io.newInputFile(manifest)](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestFiles.java#L349);
   ->
   _public InputFile EncryptingFileIO.newInputFile(ManifestFile manifest) 
       return newDecryptingInputFile(manifest.path(), **manifest.length()**, 
manifest.keyMetadata());_
   ->
   _public InputFile EncryptingFileIO.newDecryptingInputFile(String path, long 
plainLength, ByteBuffer buffer) {
       InputFile inputFile = **io.newInputFile(path, plainLength);**
       return em.decrypt(wrap(inputFile, buffer));_
   
   So that the length in the `InputFile sourceFile`, given to the 
AesGcmInputFile constructor, comes from a safe ManifestFile object (since 
manifest lists are encrypted/signed), and not from an unsafe storage.
   
   What do you think?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to