mgmarino commented on issue #12046: URL: https://github.com/apache/iceberg/issues/12046#issuecomment-2621663034
Ok, I finally have a full explanation. The issue is that Spark is cleaning up memory, moving broadcast variables to disk and this results in the closure of the I/O even if it's currently being used. This is the relevant Spark code: https://github.com/apache/spark/blob/e428fe902bb1f12cea973de7fe4b885ae69fd6ca/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1848 This is what I see in the logs: ``` 2025-01-29T11:16:00.080Z 25/01/29 11:16:00 INFO BlockManager: Dropping block broadcast_30 from memory 2025-01-29T11:16:00.080Z 25/01/29 11:16:00 INFO BlockManager: Writing block broadcast_30 to disk 2025-01-29T11:16:00.080Z 25/01/29 11:16:00 INFO SerializableTableWithSize: Releasing resources 2025-01-29T11:16:00.080Z 25/01/29 11:16:00 ERROR S3FileIO: Closing S3FileIO Client java.lang.Exception: S3FileIO: [org.apache.iceberg.aws.s3.S3FileIO@a18ec30] [software.amazon.awssdk.services.s3.DefaultS3Client@7836a391] ``` where the last line is logging I have added to track how this is being called. I was also tracking calls to e.g. getInputFile and can see this being called after close has been called. ``` 25/01/29 11:16:02 ERROR S3FileIO: Getting Input File java.lang.Exception: S3FileIO: [org.apache.iceberg.aws.s3.S3FileIO@a18ec30] is Closed: true at org.apache.iceberg.aws.s3.S3FileIO.newInputFile(S3FileIO.java:143) ~[680fb4e.jar:?] ``` by adding: ``` @Override public InputFile newInputFile(String path) { LOG.error( "Getting Input File", new Exception( "S3FileIO: [" + this + "] is Closed: " + (isResourceClosed.get() ? "true" : "false"))); return S3InputFile.fromLocation(path, client(), s3FileIOProperties, metrics); } ``` I would summarize to say that, unless it's possible to guarantee the serialization table is *not* removed from memory and persisted to disk, then it's not possible to close the IO. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org