freesinger opened a new issue, #10844:
URL: https://github.com/apache/gravitino/issues/10844

   ### Version
   
   main branch
   
   ### Describe what's wrong
   
   Gravitino-server can hit java.lang.OutOfMemoryError: Metaspace after running 
for a long time. Once it happens, requests may start failing with 401/500 
(often empty/incomplete body), which is misleading because the underlying issue 
is the JVM OOM.
   
   
   
   ### Error message and/or stacktrace
   
   - lance rest
   ```log
   {
       "error": "Unable to process: Received HTTP 500 response with empty body",
       "code": 500,
       "type": "RESTException",
       "detail": "org.apache.gravitino.exceptions.RESTException: Unable to 
process: Received HTTP 500 response with empty body\n\tat 
org.apache.gravitino.client.ErrorHandlers$RestErrorHandler.accept(ErrorHandlers.java:1333)\n\tat
 
org.apache.gravitino.client.ErrorHandlers$CatalogErrorHandler.accept(ErrorHandlers.java:549)\n\tat
 
org.apache.gravitino.client.ErrorHandlers$CatalogErrorHandler.accept(ErrorHandlers.java:488)\n\tat
 
   ....",
       "instance": "demo_catalog2$demo_schema"
   }
   ```
   
   
   - gravitino
   
   ```log
   2026-04-22 15:22:17.699
   WARN[Gravitino-webserver-41] 
[org.apache.gravitino.utils.Principalutils.doAs(Principalutils.java:50)]- doAs 
method occurs
   an unexpected error
   java.lang.OutofMemoryError: Metaspace
   ```
   
   ### How to reproduce
   
   When capturing `jcmd` outputs from an OOM’ed process, we saw many long-lived 
`org.apache.gravitino.hive.client.HiveClientClassLoader` instances. Each 
classloader retains hundreds of classes, consistent with classloader churn + 
class unloading being blocked.
   
   <img width="2798" height="1798" alt="Image" 
src="https://github.com/user-attachments/assets/845d2a52-e456-4118-adc7-6f736c3497a4";
 />
   
   
   ### Additional context
   
   
   We suspect the issue is **Hive client pool cache miss / churn**, amplified 
by **frequent token refresh**, which repeatedly creates isolated Hive-client 
classloaders:
   
   - We use a custom cloud IAM-based Hive authenticator which fetches/refreshes 
a short-lived token and injects it into the Hive client configuration.
   - If the token (or derived config) participates in the Hive client pool 
cache key, each refresh results in a new key, causing cache misses and 
continuous creation of new `HiveClientFactory` / `HiveClientClassLoader`.
   - Even if old pools are evicted, class unloading may still be blocked by 
global/static caches, `ThreadLocal`s, shutdown hooks, etc., leading to 
Metaspace growth and eventually OOM.
   
   This matches the “classloader cannot be reclaimed/unloaded” behavior that is 
known to happen in Hive/Hadoop ecosystems when classloader-bound resources are 
not fully cleaned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to