MonkeyCanCode opened a new pull request, #14590:
URL: https://github.com/apache/iceberg/pull/14590

   # Summary
   Fix memory leak in Spark from `AuthSessionCache` when using Iceberg and 
ensure resources get cleanup. 
   
   # Background
   I am using Spark Connect where end-users will be submitting their spark 
jobs/queries from their end into the remote Spark Connect server. These queries 
runtime can ranges from seconds to minutes and query per users can varies as 
well. Also, this in case, the end-users are the ones who are creating spark 
session and defined the connection info to Iceberg REST catalog. By default, 
Spark Connect server will cleanup idle sessions after one hour.
   
   What I found out interesting is the memory usage of Spark Connect is not 
able to get garbage collected after Spark Connect server killed the idle 
sessions after reached default TTL. After some debugging, this point me to 
`ClassLoader` from Apache Spark leak in `AuthSessionCache.java` from Apache 
Iceberg.
   
   # Changes
   1. Fixing the `ClassLaoder` leak in Apache Spark in `AuthSessionCache.java`
   The existed `ThreadPools.newExitingWorkerPool` created a 
`ScheduledExecutorService` and registers a JVM-level shutdown hook. This hook 
can inadvertently hold a strong reference to session specific `ClassLoader` in 
Spark connect via the tasks it manages, which preventing them from being 
released. This change replaces `newExitingWorkerPool` with `newScheduledPool` 
which creates a thread pool with daemon threads. Based on my understanding, 
daemon threads do not block JVM from existing thus prevent the issue mentioned 
above.
   
   2. Ensure proper resources cleanup in catalogs
   `CachingCatalog` and `SparkCatalog` now implements `java.io.Closeable` which 
allows them to propagate the `close` call to the underlying wrapped catalog. 
This will ensure that any resource referenced by catalogs are properly released.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to