markhoerth opened a new issue, #10839:
URL: https://github.com/apache/gravitino/issues/10839

   ### What would you like to be improved?
   
   ## Problem
   
   Gravitino does not expose standard health check endpoints. Modern Java 
services — including Apache Polaris (via Quarkus), Spring Boot applications, 
and Micronaut applications — ship liveness, readiness, and aggregate health 
endpoints out of the box. Gravitino runs on raw Jetty + Jersey and has none.
   
   This creates friction on day one of any enterprise deployment:
   
   - **Kubernetes probes.** `livenessProbe` and `readinessProbe` expect HTTP 
endpoints that distinguish "restart this pod" from "take it out of rotation." 
The current `/api/version` endpoint conflates the two and, more importantly, 
doesn't actually verify Gravitino's ability to serve traffic.
   - **Load balancers and GTMs.** Standard enterprise traffic managers poll a 
configurable health endpoint and route based on HTTP status. Gravitino has no 
such endpoint.
   - **APM tooling.** Datadog, New Relic, and similar platforms auto-discover 
health endpoints following MicroProfile or Spring conventions.
   - **Bakeoff checklists.** Competitive evaluations mark Gravitino down on 
"operational readiness" because health endpoints are considered table stakes.
   
   ## Proposal
   
   Add three endpoints under `/api/health`, following [MicroProfile 
Health](https://microprofile.io/specifications/microprofile-health/) semantics 
(the same pattern Quarkus, Spring Boot, and Polaris use):
   
   | Endpoint | Purpose | Healthy | Unhealthy |
   |---|---|---|---|
   | `GET /api/health/live` | Liveness — is the HTTP thread able to respond? | 
200 | (no response / timeout → K8s restarts pod) |
   | `GET /api/health/ready` | Readiness — is the entity store reachable? | 200 
| 503 (LB/GTM routes away) |
   | `GET /api/health` | Aggregate of live + ready, for consumers expecting a 
single URL | 200 | 503 |
   
   All three return a JSON body describing the individual checks:
   
   ```json
   {
     "code": 0,
     "status": "UP",
     "checks": [
       { "name": "httpServer", "status": "UP", "details": {} },
       { "name": "entityStore", "status": "UP", "details": {} }
     ]
   }
   ```
   
   ## Design notes
   
   - **Liveness does not touch downstream systems.** If the process is 
deadlocked or OOM, the probe fails by not returning. This is the 
K8s-recommended pattern.
   - **Readiness probes the entity store** via a bounded-timeout call to 
`EntityStore.exists()` against a sentinel identifier. The call exercises the 
JDBC connection without requiring any entity to be present.
   - **Failure details are minimal by design.** We report the exception class 
name, not full stack traces, to avoid leaking internal state over an 
unauthenticated endpoint.
   - **Endpoints must be reachable without authentication** for K8s probes and 
external GTMs to work. This requires the auth filter to exempt `/api/health*`.
   
   ## Out of scope for this issue
   
   - Per-catalog backend health (belongs in a separate `/api/health/detailed` 
or `/api/catalogs/{name}/health` — catalog outages should not pull Gravitino 
itself out of rotation).
   - Prometheus-formatted `/metrics` endpoint (separate effort).
   - Startup probes (K8s 1.16+ — can be added if a slow-start subsystem is 
identified).
   
   ### What changes were proposed in this pull request?
   
   Adds three health check endpoints following MicroProfile Health semantics:
   
   - `GET /api/health/live` — liveness, returns 200 when the HTTP thread is 
responsive
   - `GET /api/health/ready` — readiness, returns 200 when the entity store is 
reachable, 503 otherwise
   - `GET /api/health` — aggregate, returns 200 when both checks pass
   
   All endpoints return a JSON body describing per-check status. See the linked 
issue for response schema and rationale.
   
   ### Why are the changes needed?
   
   Modern Java services (Apache Polaris, Spring Boot, Quarkus, Micronaut) ship 
these endpoints by default. Gravitino runs on raw Jetty and does not, which 
blocks standard Kubernetes probe configuration, load balancer health checks, 
and enterprise GTM integration. This is a parity gap that surfaces on day one 
of enterprise deployments.
   
   Fixes #<ISSUE_NUMBER>
   
   ### Does this PR introduce any user-facing change?
   
   Yes — adds three new public endpoints. No existing endpoint behavior is 
changed.
   
   ### How was this patch tested?
   
   - New unit tests in `TestHealthOperations` covering:
     - Liveness always returns 200
     - Readiness returns 503 when entity store is uninitialized
     - Readiness returns 200 when entity store is reachable
     - Readiness returns 503 when entity store throws
     - Aggregate returns 200 when all checks pass
     - Aggregate returns 503 when any check fails
   - Manual verification against a running Gravitino instance: `curl -i 
http://localhost:8090/api/health/live`, `/api/health/ready`, `/api/health`.
   
   ### Notes for reviewers
   
   - **Auth filter exemption.** These endpoints must be reachable without 
credentials for K8s probes and external GTMs. If the current auth configuration 
blocks unauthenticated access to `/api/**`, a filter-level exemption for 
`/api/health*` is needed alongside this PR. Happy to add that in this PR or a 
follow-up depending on reviewer preference.
   - **Bounded timeout on entity store probe.** The readiness check runs 
`EntityStore.exists()` with a 2-second ceiling via `CompletableFuture` to 
prevent a hanging JDBC connection from tying up Jetty worker threads.
   - **Response body format.** `HealthResponse` extends `BaseResponse` and 
keeps `code: 0` even in 503 responses — the HTTP status is the probe signal, 
the body is diagnostic. This is intentional and differs from `ErrorResponse` 
usage.
   - **Sentinel identifier.** Readiness calls `exists("gravitino_health_probe", 
METALAKE)`. It's expected to return `false` in a healthy system; we only care 
that the call round-trips without throwing.
   
   ### How should we improve?
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to