mgmarino commented on issue #12046:
URL: https://github.com/apache/iceberg/issues/12046#issuecomment-2612986973

   After doing some further investigation, my initial conclusion is the 
following:
   
   - I can see `SerializableTableWithSize` being generated on the driver at 
least in two different places:
       - `org.apache.iceberg.spark.source.SparkWrite.createWriterFactory`: 
   
https://github.com/apache/iceberg/blob/6e2bc9ac4ef9ca9afeff66814de6567ae63da9da/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java#L190
       - `org.apache.iceberg.spark.source.SparkBatch.planInputPartitions`: 
https://github.com/apache/iceberg/blob/6e2bc9ac4ef9ca9afeff66814de6567ae63da9da/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatch.java#L78
   where both tables are pointing to the same `FileIO` object (in this case 
`S3FileIO`).
   - If these jobs get submitted to the same Executor, on deserialization they 
will still point to the *same* IO object, meaning that when one gets cleaned up 
(and closed), it will affect the other.
   
   I am not sure what a good solution is here, but I suspect that the FileIO 
may need to be copied when creating the serializable table instead of what is 
done now:
   
   
https://github.com/apache/iceberg/blob/6e2bc9ac4ef9ca9afeff66814de6567ae63da9da/core/src/main/java/org/apache/iceberg/SerializableTable.java#L123
   
   Would love to get some input here!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to