markhoerth opened a new issue, #10839: URL: https://github.com/apache/gravitino/issues/10839
### What would you like to be improved? ## Problem Gravitino does not expose standard health check endpoints. Modern Java services — including Apache Polaris (via Quarkus), Spring Boot applications, and Micronaut applications — ship liveness, readiness, and aggregate health endpoints out of the box. Gravitino runs on raw Jetty + Jersey and has none. This creates friction on day one of any enterprise deployment: - **Kubernetes probes.** `livenessProbe` and `readinessProbe` expect HTTP endpoints that distinguish "restart this pod" from "take it out of rotation." The current `/api/version` endpoint conflates the two and, more importantly, doesn't actually verify Gravitino's ability to serve traffic. - **Load balancers and GTMs.** Standard enterprise traffic managers poll a configurable health endpoint and route based on HTTP status. Gravitino has no such endpoint. - **APM tooling.** Datadog, New Relic, and similar platforms auto-discover health endpoints following MicroProfile or Spring conventions. - **Bakeoff checklists.** Competitive evaluations mark Gravitino down on "operational readiness" because health endpoints are considered table stakes. ## Proposal Add three endpoints under `/api/health`, following [MicroProfile Health](https://microprofile.io/specifications/microprofile-health/) semantics (the same pattern Quarkus, Spring Boot, and Polaris use): | Endpoint | Purpose | Healthy | Unhealthy | |---|---|---|---| | `GET /api/health/live` | Liveness — is the HTTP thread able to respond? | 200 | (no response / timeout → K8s restarts pod) | | `GET /api/health/ready` | Readiness — is the entity store reachable? | 200 | 503 (LB/GTM routes away) | | `GET /api/health` | Aggregate of live + ready, for consumers expecting a single URL | 200 | 503 | All three return a JSON body describing the individual checks: ```json { "code": 0, "status": "UP", "checks": [ { "name": "httpServer", "status": "UP", "details": {} }, { "name": "entityStore", "status": "UP", "details": {} } ] } ``` ## Design notes - **Liveness does not touch downstream systems.** If the process is deadlocked or OOM, the probe fails by not returning. This is the K8s-recommended pattern. - **Readiness probes the entity store** via a bounded-timeout call to `EntityStore.exists()` against a sentinel identifier. The call exercises the JDBC connection without requiring any entity to be present. - **Failure details are minimal by design.** We report the exception class name, not full stack traces, to avoid leaking internal state over an unauthenticated endpoint. - **Endpoints must be reachable without authentication** for K8s probes and external GTMs to work. This requires the auth filter to exempt `/api/health*`. ## Out of scope for this issue - Per-catalog backend health (belongs in a separate `/api/health/detailed` or `/api/catalogs/{name}/health` — catalog outages should not pull Gravitino itself out of rotation). - Prometheus-formatted `/metrics` endpoint (separate effort). - Startup probes (K8s 1.16+ — can be added if a slow-start subsystem is identified). ### What changes were proposed in this pull request? Adds three health check endpoints following MicroProfile Health semantics: - `GET /api/health/live` — liveness, returns 200 when the HTTP thread is responsive - `GET /api/health/ready` — readiness, returns 200 when the entity store is reachable, 503 otherwise - `GET /api/health` — aggregate, returns 200 when both checks pass All endpoints return a JSON body describing per-check status. See the linked issue for response schema and rationale. ### Why are the changes needed? Modern Java services (Apache Polaris, Spring Boot, Quarkus, Micronaut) ship these endpoints by default. Gravitino runs on raw Jetty and does not, which blocks standard Kubernetes probe configuration, load balancer health checks, and enterprise GTM integration. This is a parity gap that surfaces on day one of enterprise deployments. Fixes #<ISSUE_NUMBER> ### Does this PR introduce any user-facing change? Yes — adds three new public endpoints. No existing endpoint behavior is changed. ### How was this patch tested? - New unit tests in `TestHealthOperations` covering: - Liveness always returns 200 - Readiness returns 503 when entity store is uninitialized - Readiness returns 200 when entity store is reachable - Readiness returns 503 when entity store throws - Aggregate returns 200 when all checks pass - Aggregate returns 503 when any check fails - Manual verification against a running Gravitino instance: `curl -i http://localhost:8090/api/health/live`, `/api/health/ready`, `/api/health`. ### Notes for reviewers - **Auth filter exemption.** These endpoints must be reachable without credentials for K8s probes and external GTMs. If the current auth configuration blocks unauthenticated access to `/api/**`, a filter-level exemption for `/api/health*` is needed alongside this PR. Happy to add that in this PR or a follow-up depending on reviewer preference. - **Bounded timeout on entity store probe.** The readiness check runs `EntityStore.exists()` with a 2-second ceiling via `CompletableFuture` to prevent a hanging JDBC connection from tying up Jetty worker threads. - **Response body format.** `HealthResponse` extends `BaseResponse` and keeps `code: 0` even in 503 responses — the HTTP status is the probe signal, the body is diagnostic. This is intentional and differs from `ErrorResponse` usage. - **Sentinel identifier.** Readiness calls `exists("gravitino_health_probe", METALAKE)`. It's expected to return `false` in a healthy system; we only care that the call round-trips without throwing. ### How should we improve? _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
