boaz-gold commented on issue #15898: URL: https://github.com/apache/iceberg/issues/15898#issuecomment-4275550650
I want to flag that PR https://github.com/apache/iceberg/pull/15910 no longer fixes this issue. The original commit had the right idea — it added table.io().close() in the RemovalListener. But after reviewer feedback about shared FileIO instances, the close call was removed entirely. What's left in the PR is just changing RemovalCause.EXPIRED.equals(cause) to cause.wasEvicted(), which has nothing to do with the thread leak. I'm hitting this in production. After 24h with ~300 Iceberg tables and a 30s cache TTL we end up with ~3,500 live S3Client instances and ~28,000 leaked sdk-ScheduledExecutor threads, crashing with os::commit_memory failed; error='Not enough space' (errno=12) — thread stack exhaustion, not heap OOM. Had to roll back to EMR 7.9 as a workaround. On the shared FileIO concern — I think the right answer is to still call close() and let each FileIO implementation decide what that means. S3FileIO creates a client per table, it's never shared, so closing on eviction is always safe. If some catalog shares a single FileIO across tables, that implementation should make close() a no-op, not the other way around. Would be happy to test any proposed fix on a live cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
