boaz-gold commented on PR #15910:
URL: https://github.com/apache/iceberg/pull/15910#issuecomment-4275547660
I want to flag that PR #15910 no longer fixes this issue.
The original commit had the right idea — it added table.io().close() in the
RemovalListener. But after reviewer feedback about shared FileIO instances, the
close call was removed entirely. What's left in the PR is just changing
RemovalCause.EXPIRED.equals(cause) to cause.wasEvicted(), which has nothing to
do with the thread leak.
I'm hitting this in production. After 24h with ~300 Iceberg tables and a 30s
cache TTL we end up with ~3,500 live S3Client instances and ~28,000 leaked
sdk-ScheduledExecutor threads, crashing with os::commit_memory failed;
error='Not enough space' (errno=12) — thread stack exhaustion, not heap OOM.
Had to roll back to EMR 7.9 as a workaround.
On the shared FileIO concern — I think the right answer is to still call
close() and let each FileIO implementation decide what that means. S3FileIO
creates a client per table, it's never shared, so closing on eviction is always
safe. If some catalog shares a single FileIO across tables, that implementation
should make close() a no-op, not the other
way around.
Would be happy to test any proposed fix on a live cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]