Hi all,
As Gravitino evolves toward distributed multi-node deployment, I'd like
to open a discussion on how to handle cache consistency across nodes.
*Background*
Gravitino currently uses local Caffeine caches for metadata. In a
single-node setup this works well, but in a multi-node active-active
deployment, a write on Node A leaves Node B's cache stale.
With no invalidation mechanism, the only bound on staleness is the cache
TTL — which can be tens of seconds or more.
*Goals*
- Convergence within ≤ 2 seconds for metadata caches under normal
conditions.
- Stricter handling for authorization caches (roles, privileges) where
staleness is a security concern.
- No mandatory new external dependencies (H2/MySQL/PostgreSQL all
supported; Apache project principle).
* Proposed Approach*
After surveying how similar systems handle this (Keycloak, Project
Nessie, Trino, Etcd, HMS, and others), I am proposing a three-layer design:
┌────────────┬──────────────────────────────────┬─────────────────────────────────────────────────────┐
│ Layer │ Mechanism
│ Role
│
├────────────┼──────────────────────────────────┼─────────────────────────────────────────────────────┤
│ Fast path │ HTTP fan-out on write
│ Near-instant invalidation for online nodes
│
├────────────┼──────────────────────────────────┼─────────────────────────────────────────────────────┤
│ Safety net │ DB global version polling (~1 s)
│ Catch-up for restarted or temporarily offline
nodes │
├────────────┼──────────────────────────────────┼─────────────────────────────────────────────────────┤
│ Hard bound │ Short TTL on all cache entries
│ Unconditional worst-case staleness cap
│
└────────────┴──────────────────────────────────┴─────────────────────────────────────────────────────┘
This requires zero new infrastructure. The design also defines a
CacheInvalidationTransport SPI for deployments that want stronger delivery
guarantees — with JGroups (embedded, AP model, no
external service) as the recommended optional implementation, and Redis
Pub/Sub/Streams as an alternative for deployments already running Redis.
The full analysis — including option comparisons, architecture diagrams,
industry references, and a phased implementation plan — is available here:
https://docs.google.com/document/d/18BI8dlcrs9nIHF1l7zH17L1YACA1erziBGkzWgWcK5k/edit?tab=t.0
I'd appreciate the community's input on the following open questions:
1. *Overall approach* — Does the three-layer design (fast-path
invalidation + DB version polling as safety net + short TTL as hard bound)
look like the right direction? Are there failure modes or
deployment scenarios we have not considered?
2. *External middleware policy* — The current proposal avoids mandatory
external dependencies (no Redis, ZooKeeper, Etcd, etc.) to keep Gravitino
self-contained. Should we reconsider this? Some
production deployments likely already run Redis or an Etcd cluster;
allowing (or recommending) an external message bus could simplify the
implementation considerably.
3. *Strong consistency requirement* — The proposal targets eventual
consistency with a short convergence window (≤ 2 s). Is that acceptable for
all cache classes? In particular, should
authorization caches (roles, privileges, ownership) be held to a stricter
standard — e.g., read-through on every request with no local caching — even
at the cost of higher DB load?
4. *Pluggable strategy* — Should Gravitino expose a
CacheInvalidationTransport SPI so that operators can choose their own
consistency mechanism (HTTP fan-out, JGroups, Redis, etc.) rather than the
project committing to a single built-in solution? Or does that add too much
complexity for users?
5. *Anything we missed* — Are there constraints, use cases, or prior art
in the Gravitino ecosystem that should influence this design?
Looking forward to the discussion.
Thanks,
Qi Yu