stevenzwu commented on code in PR #10926:
URL: https://github.com/apache/iceberg/pull/10926#discussion_r1736772930


##########
core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java:
##########
@@ -63,7 +63,11 @@ public class HadoopFileIO implements HadoopConfigurable, 
DelegateFileIO {
    * <p>{@link Configuration Hadoop configuration} must be set through {@link
    * HadoopFileIO#setConf(Configuration)}
    */
-  public HadoopFileIO() {}
+  public HadoopFileIO() {
+    // Create a default hadoopConf as it is required for the object to be 
valid.
+    // E.g. newInputFile would throw NPE with hadoopConf.get() otherwise.
+    this.hadoopConf = new SerializableConfiguration(new Configuration())::get;

Review Comment:
   > I have seen that we use new Configuration(false) in the code, so we allow 
for the user to provide a trimmed configuration, and in this case the 
serialized config is quite small for the binary serialization. We might have to 
do something similar for the JSON serialization.
   
   Are you saying trim the key-value pairs with the entries from default 
configuration as `new Configuration(false)`. We can potentially do that. but it 
still has the implication that it depends on the runtime env. if the other side 
(deserialization) has a different env (default config), this can be different.
   
   > The ManifestListReadTask.rows() and ManifestListReadTask.file() is using 
the io to get the new input file like io.newInputFile(manifestListLocation). 
   
   `FileIO` is used to read from manifest file. `ManifestFiles.read` is a 
widely used API.
   ```
       private CloseableIterable<? extends ContentFile<?>> files(Schema 
fileProjection) {
         switch (manifest.content()) {
           case DATA:
             return ManifestFiles.read(manifest, io, 
specsById).project(fileProjection);
           case DELETES:
             return ManifestFiles.readDeleteManifest(manifest, io, 
specsById).project(fileProjection);
   ...
       }
   ```
   
   >  FileIO serialization will become even more complicated when the manifest 
file encryption arrives here. We will need to apply the encryption for the 
FileIO 
   
   Looking at the usage of `EncryptingFileIO`, it is not part of the table ops 
state. It should never need to be serialized. only original `FileIO` and 
`EncryptionManager` need to be serialized.
   
   ```
     protected EncryptedOutputFile newManifestOutputFile() {
       String manifestFileLocation =
           ops.metadataFileLocation(
               FileFormat.AVRO.addExtension(commitUUID + "-m" + 
manifestCount.getAndIncrement()));
       return EncryptingFileIO.combine(ops.io(), ops.encryption())
           .newEncryptingOutputFile(manifestFileLocation);
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to