freesinger opened a new issue, #10844:
URL: https://github.com/apache/gravitino/issues/10844
### Version
main branch
### Describe what's wrong
Gravitino-server can hit java.lang.OutOfMemoryError: Metaspace after running
for a long time. Once it happens, requests may start failing with 401/500
(often empty/incomplete body), which is misleading because the underlying issue
is the JVM OOM.
### Error message and/or stacktrace
- lance rest
```log
{
"error": "Unable to process: Received HTTP 500 response with empty body",
"code": 500,
"type": "RESTException",
"detail": "org.apache.gravitino.exceptions.RESTException: Unable to
process: Received HTTP 500 response with empty body\n\tat
org.apache.gravitino.client.ErrorHandlers$RestErrorHandler.accept(ErrorHandlers.java:1333)\n\tat
org.apache.gravitino.client.ErrorHandlers$CatalogErrorHandler.accept(ErrorHandlers.java:549)\n\tat
org.apache.gravitino.client.ErrorHandlers$CatalogErrorHandler.accept(ErrorHandlers.java:488)\n\tat
....",
"instance": "demo_catalog2$demo_schema"
}
```
- gravitino
```log
2026-04-22 15:22:17.699
WARN[Gravitino-webserver-41]
[org.apache.gravitino.utils.Principalutils.doAs(Principalutils.java:50)]- doAs
method occurs
an unexpected error
java.lang.OutofMemoryError: Metaspace
```
### How to reproduce
When capturing `jcmd` outputs from an OOM’ed process, we saw many long-lived
`org.apache.gravitino.hive.client.HiveClientClassLoader` instances. Each
classloader retains hundreds of classes, consistent with classloader churn +
class unloading being blocked.
<img width="2798" height="1798" alt="Image"
src="https://github.com/user-attachments/assets/845d2a52-e456-4118-adc7-6f736c3497a4"
/>
### Additional context
We suspect the issue is **Hive client pool cache miss / churn**, amplified
by **frequent token refresh**, which repeatedly creates isolated Hive-client
classloaders:
- We use a custom cloud IAM-based Hive authenticator which fetches/refreshes
a short-lived token and injects it into the Hive client configuration.
- If the token (or derived config) participates in the Hive client pool
cache key, each refresh results in a new key, causing cache misses and
continuous creation of new `HiveClientFactory` / `HiveClientClassLoader`.
- Even if old pools are evicted, class unloading may still be blocked by
global/static caches, `ThreadLocal`s, shutdown hooks, etc., leading to
Metaspace growth and eventually OOM.
This matches the “classloader cannot be reclaimed/unloaded” behavior that is
known to happen in Hive/Hadoop ecosystems when classloader-bound resources are
not fully cleaned.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]