boaz-gold commented on issue #15898:
URL: https://github.com/apache/iceberg/issues/15898#issuecomment-4275550650

   I want to flag that PR https://github.com/apache/iceberg/pull/15910 no 
longer fixes this issue.
   
   The original commit had the right idea — it added table.io().close() in the 
RemovalListener. But after reviewer feedback about shared FileIO instances, the 
close call was removed entirely. What's left in the PR is just changing 
RemovalCause.EXPIRED.equals(cause) to cause.wasEvicted(), which has nothing to 
do with the thread leak.
   
   I'm hitting this in production. After 24h with ~300 Iceberg tables and a 30s 
cache TTL we end up with ~3,500 live S3Client instances and ~28,000 leaked 
sdk-ScheduledExecutor threads, crashing with os::commit_memory failed; 
error='Not enough space' (errno=12) — thread stack exhaustion, not heap OOM.
   
   Had to roll back to EMR 7.9 as a workaround.
   
   On the shared FileIO concern — I think the right answer is to still call 
close() and let each FileIO implementation decide what that means. S3FileIO 
creates a client per table, it's never shared, so closing on eviction is always 
safe. If some catalog shares a single FileIO across tables, that implementation 
should make close() a no-op, not the other
   way around.
   
   Would be happy to test any proposed fix on a live cluster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to