stevenzwu commented on code in PR #10926:
URL: https://github.com/apache/iceberg/pull/10926#discussion_r1734784202


##########
core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java:
##########
@@ -63,7 +63,11 @@ public class HadoopFileIO implements HadoopConfigurable, 
DelegateFileIO {
    * <p>{@link Configuration Hadoop configuration} must be set through {@link
    * HadoopFileIO#setConf(Configuration)}
    */
-  public HadoopFileIO() {}
+  public HadoopFileIO() {
+    // Create a default hadoopConf as it is required for the object to be 
valid.
+    // E.g. newInputFile would throw NPE with hadoopConf.get() otherwise.
+    this.hadoopConf = new SerializableConfiguration(new Configuration())::get;

Review Comment:
   > There could potentially be hundreds, sometimes thousands of config keys in 
a Configuration objects (I have seen above 2 thousands in some cases).
   
   agree that size is non-trivial. but note that `FileIO` is not serialized per 
file scan task. it was [serialized per manifest file scan 
task](https://github.com/apache/iceberg/pull/10735/files#diff-8bc8bee8f0eb396c2e6bf7175fca3207fad4daead009d8cc25aa2360f8544c55).
 So the impact is not too bad.
   
   > 1. Use empty config - if the config is really used this will cause issues
   
   this is what the current PR is doing. but it is not correct. Hadoop 
configuration is part of the `HadoopFileIO` object state and `FileIOParser` is 
carrying it over during serialization.
   
   > Serialize only the non-default values - version change could cause issues
   
   Is this doable? As you were also saying, it can be problematic, as the 
object deserialization also depends on the environment. that is also 
technically incorrect.
   
   > 3. Use the one provided by the catalog - task/job specific configs can 
cause issues
   
   `FileIOParser` has no access to a catalog. Plus, it also could be a 
`RESTCatalog` with `HadoopFileIO`.
   
   > This way we can serialize Hadoops Configuration into a Map<String, String> 
and deserialize it into Parquets' PlainParquetConfiguration equivalent, which 
is probably much more lightweight.
   
   @Fokko I want to point out that this change/fix is for the iceberg-core 
module, not just for Flink. Let me make sure that I understand you correctly. 
   
   `FileIOParser` can serialize the configuration as string key-value pairs, 
that is inline with what I was saying earlier about serialization only string 
key-value pairs. 
   
   But we want to avoid using Hadoop `Configuration` class to set. We probably 
need to extend Iceberg's `HadoopConfigurable` interface with a new setter, like 
`setConfigPropertes(Map<String, String> configProperties`.  Then why don't we 
merge the Hadoop configuration with the existing `Map<String, String>` 
properties in the `HadoopFileIO`, which was the 
[alternative](https://github.com/apache/iceberg/pull/10926#discussion_r1733692980)
 approach I mentioned in this comment.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to