pvary commented on code in PR #10926:
URL: https://github.com/apache/iceberg/pull/10926#discussion_r1734538084


##########
core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java:
##########
@@ -63,7 +63,11 @@ public class HadoopFileIO implements HadoopConfigurable, 
DelegateFileIO {
    * <p>{@link Configuration Hadoop configuration} must be set through {@link
    * HadoopFileIO#setConf(Configuration)}
    */
-  public HadoopFileIO() {}
+  public HadoopFileIO() {
+    // Create a default hadoopConf as it is required for the object to be 
valid.
+    // E.g. newInputFile would throw NPE with hadoopConf.get() otherwise.
+    this.hadoopConf = new SerializableConfiguration(new Configuration())::get;

Review Comment:
   I'm very concerned about the approach to serialize/deserialize the Hadoop 
Configuration. There could potentially be hundreds, sometimes thousands of 
config keys in a Configuration objects (I have seen above 2 thousands in some 
cases). When we start writing out string key-value parts, we even lose the gzip 
compression we currently have for the Configuration.write method. This will 
increase the size of the object seriously.
   
   I think we should consider some ways to reduce the size of the objects:
   1. Use empty config - if the config is really used this will cause issues
   2. Serialize only the non-default values - version change could cause issues
   3. Use the one provided by the catalog - task/job specific configs can cause 
issues
   
   I'm leaning towards the 2nd option



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to