mgmarino commented on issue #12046: URL: https://github.com/apache/iceberg/issues/12046#issuecomment-2612986973
After doing some further investigation, my initial conclusion is the following: - I can see `SerializableTableWithSize` being generated on the driver at least in two different places: - `org.apache.iceberg.spark.source.SparkWrite.createWriterFactory`: https://github.com/apache/iceberg/blob/6e2bc9ac4ef9ca9afeff66814de6567ae63da9da/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java#L190 - `org.apache.iceberg.spark.source.SparkBatch.planInputPartitions`: https://github.com/apache/iceberg/blob/6e2bc9ac4ef9ca9afeff66814de6567ae63da9da/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatch.java#L78 where both tables are pointing to the same `FileIO` object (in this case `S3FileIO`). - If these jobs get submitted to the same Executor, on deserialization they will still point to the *same* IO object, meaning that when one gets cleaned up (and closed), it will affect the other. I am not sure what a good solution is here, but I suspect that the FileIO may need to be copied when creating the serializable table instead of what is done now: https://github.com/apache/iceberg/blob/6e2bc9ac4ef9ca9afeff66814de6567ae63da9da/core/src/main/java/org/apache/iceberg/SerializableTable.java#L123 Would love to get some input here! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org