bharos commented on code in PR #10696:
URL: https://github.com/apache/gravitino/pull/10696#discussion_r3089450297
##########
design-docs/cache-improvement-design.md:
##########
@@ -0,0 +1,1421 @@
+# Gravitino Cache Improvement Design
+
+---
+
+## 1. Background
+
+### 1.1 System Overview
+
+Gravitino is a unified metadata management control plane. Compute engines
(Spark, Flink, Trino)
+call it during query planning to resolve catalog, schema, and table metadata,
and to verify
+user permissions. The access pattern is distinctly **read-heavy,
write-light**: DDL operations
+are infrequent, and metadata is resolved once per job.
+
+Gravitino is evolving from single-node to multi-node active-active HA
deployment. Each node
+currently maintains its own independent in-process Caffeine cache with no
cross-node
+synchronisation. Under HA, any write on one node leaves other nodes' caches
stale until TTL
+expiry.
+
+---
+
+### 1.2 Current Cache Architecture Overview
+
+Gravitino maintains three distinct caching layers for the authorization path:
+
+```
+┌──────────────────────────────────────────────────────┐
+│ Layer 3: Per-request cache (AuthorizationRequestContext) │
+│ Scope: one HTTP request; prevents duplicate auth calls │
+├──────────────────────────────────────────────────────┤
+│ Layer 2: Auth policy caches (JcasbinAuthorizer) │
+│ loadedRoles Cache<Long, Boolean> hook update/TTL │
+│ ownerRel Cache<Long, Optional<Long>> hook update/TTL │
+├──────────────────────────────────────────────────────┤
+│ Layer 1: Entity store cache (RelationalEntityStore) │
+│ CaffeineEntityCache — or NoOpsCache when disabled │
+│ Caches entity reads and relation queries for all modules │
+│ Controlled by Configs.CACHE_ENABLED │
+└──────────────────────────────────────────────────────┘
+```
+
+**JCasbin is the core of the auth cache system.** It maintains an in-memory
policy table:
+
+```
+(roleId, objectType, metadataId, privilege) → ALLOW | DENY
+```
+
+The Layer 2 caches exist solely to manage JCasbin's policy loading lifecycle:
+
+| Cache | Role
|
+|-----------------------------------------|----------------------------------------------------------------------------------------------------------------|
+| `loadedRoles: Cache<Long, Boolean>` | Tracks which roles are already
loaded into JCasbin — prevents repeated [C2]+[C3] queries on every auth request
|
+| `ownerRel: Cache<Long, Optional<Long>>` | Caches owner lookups — **prevents
[D1] on every auth request** (2–4 `isOwner()` calls per request, see §1.3.2) |
+
+Without `loadedRoles`, every auth request would re-execute N DB queries to
reload all of a
+user's role policies into JCasbin. These two caches are the reason the auth
path is fast on
+the warm path. Layer 1 (entity cache) additionally accelerates the name→ID
resolution calls
+([A], [B], [C1]) that feed into JCasbin's enforce call.
+
+---
+
+#### 1.2.1 Problems with the Current Entity Cache
+
+**The entity cache (Layer 1) has accumulated significant complexity and is not
well-suited to
+serve as a general-purpose or auth-dedicated caching layer.**
+
+##### Mixed responsibilities make it hard to maintain
+
+`CaffeineEntityCache` uses a single `Cache<EntityCacheRelationKey,
List<Entity>>` to store
+three semantically different kinds of data:
+
+| Stored data | Key form |
Example relation types |
+|-------------------------|--------------------------------------------------|-----------------------------------------------------------|
+| Direct entity | `(nameIdentifier, entityType, null)` |
any entity: catalog, schema, table, user, role, ... |
+| Relation result set | `(nameIdentifier, entityType, relType)` |
`ROLE_USER_REL`, `TAG_METADATA_OBJECT_REL`, ... |
+| Reverse index entries | `ReverseIndexCache` (separate radix tree) |
entity → list of cache keys that reference it |
+
+On top of this, a `cacheIndex` (radix tree) keeps a prefix-indexed view of all
keys to
+support cascading invalidation. The resulting invalidation logic
(`invalidateEntities`) is a
+BFS traversal that walks both the forward index and the reverse index, making
it difficult to
+reason about correctness and hard to extend safely.
+
+The five relation types currently tracked (`METADATA_OBJECT_ROLE_REL`,
`ROLE_USER_REL`,
+`ROLE_GROUP_REL`, `POLICY_METADATA_OBJECT_REL`, `TAG_METADATA_OBJECT_REL`) are
all
+auth-related, which reflects the original design intent: **the entity cache
was built
+primarily to serve the auth path.** Over time it accumulated relation types
and reverse-index
+logic without a clear ownership model, making it harder to maintain and evolve.
+
+##### Limited benefit for non-auth interfaces
+
+For general metadata API calls (list catalogs, list schemas, list tables), the
entity cache
+provides minimal benefit:
+
+| Operation | Goes through cache? | Notes
|
+|------------------------------------|---------------------|---------------------------------------------------|
+| `list(namespace, type)` | **No** | Bypasses cache
entirely; always hits DB |
+| `get(ident, type)` (single entity) | Yes | Cache helps on
repeated reads of the same entity |
+| `update(ident, type)` | Invalidate only | Invalidates
entry, write always goes to DB |
+| `listEntitiesByRelation(...)` | Yes | Only for the five
auth-centric relation types |
+
+In practice, the most common metadata browsing operations (`LIST` endpoints)
are not cached
+at the entity store level. The cache's real workload is the auth path, where
the same user
+entity, role assignments, and resource IDs are resolved on every single
authorization check.
+
+**Conclusion:** The entity cache is a de-facto auth cache dressed up as a
general-purpose
+cache. Its complexity is unjustified for the non-auth use case, and its
TTL-based consistency
+model is insufficient for the auth use case (see §1.8). A purpose-built auth
cache layer —
+separate from the entity store — is the cleaner path forward.
+
+---
+
+### 1.3 JCasbin Authorization — Deep Dive
+
+#### 1.3.1 Call Graph for a Single `authorize()` Check
+
+```
+JcasbinAuthorizer.authorize(principal, metalake, metadataObject, privilege)
+│
+├─ [A] getUserEntity(username, metalake)
+│ entityStore.get(USER by NameIdentifier)
+│ → Needed to obtain integer userId for JCasbin enforce()
+│
+├─ [B] MetadataIdConverter.getID(metadataObject, metalake) ← TARGET
RESOURCE
+│ entityStore.get(entity by NameIdentifier)
+│ → Needed to get integer metadataId for JCasbin enforce()
+│ → Called on every auth request
+│
+├─ [C] loadRolePrivilege(metalake, username, userId, requestContext)
+│ │ (guarded by requestContext.hasLoadRole — runs once per HTTP request)
+│ │
+│ ├─ [C1] entityStore.listEntitiesByRelation(ROLE_USER_REL, userIdentifier)
+│ │ → Get all roles assigned to this user
+│ │
+│ └─ For each role NOT already in loadedRoles cache:
+│ ├─ [C2] entityStore.get(RoleEntity by name) ← async, thread pool
+│ └─ loadPolicyByRoleEntity(roleEntity)
+│ └─ For each securableObject in role.securableObjects():
+│ ├─ [C3] MetadataIdConverter.getID(securableObject, metalake)
+│ └─ enforcer.addPolicy(roleId, objType, metadataId, privilege,
effect)
+│
+│ loadedRoles.put(roleId, true) ← mark role as loaded
+│
+├─ [D] isOwner() / loadOwnerPolicy(...) ← called on EVERY auth request (not
only OWNER
+│ │ privilege checks). Nearly all auth expressions contain ANY(OWNER,
METALAKE, CATALOG),
+│ │ which expands to METALAKE::OWNER || CATALOG::OWNER || … and calls
isOwner() directly
+│ │ via OGNL, independently of the authorize() path. Typical call count:
2–4 per request.
+│ ├─ Check ownerRel cache → if HIT, return (most non-owner users get
Optional.empty())
+│ └─ [D1] entityStore.listEntitiesByRelation(OWNER_REL, ...)
+│ ownerRel.put(metadataId, Optional.of(ownerId))
+│
+└─ [E] enforcer.enforce(userId, objectType, metadataId, privilege) ←
in-memory, O(1)
+```
+
+#### 1.3.2 What Each Cache Protects
+
+`loadedRoles: Cache<Long, Boolean>` — answers "is this role's policy already
in JCasbin?"
+Without it, every request re-executes [C2]+[C3] for all roles the user has
(N+1 queries).
+With it, [C2]+[C3] only run on first load per role. **This is the most
critical cache.**
+
+`ownerRel: Cache<Long, Optional<Long>>` — caches ownership lookups for
OWNER-privilege
+checks. **Contrary to initial analysis, `ownerRel` is consulted on virtually
every auth
+request**, not only when `privilege == OWNER`. The reason is that nearly every
authorization
+expression in `AuthorizationExpressionConstants` includes `ANY(OWNER,
METALAKE, CATALOG)`
+or similar clauses (e.g. `LOAD_TABLE_AUTHORIZATION_EXPRESSION`,
+`FILTER_TABLE_AUTHORIZATION_EXPRESSION`,
`LOAD_CATALOG_AUTHORIZATION_EXPRESSION`). The
+`ANY(OWNER, …)` macro expands to `METALAKE::OWNER || CATALOG::OWNER || …`, and
each
+`X::OWNER` term calls `isOwner()` directly — a code path that is **independent
of
+`authorize()`**. As a result, every auth request triggers 2–4 `isOwner()`
calls (one per
+ancestor level), each consulting `ownerRel`. For most non-owner users,
`ownerRel` caches
+`Optional.empty()`, which lets the ownership sub-check fail quickly without a
DB query.
+Without `ownerRel`, every auth request would add 2–4 extra DB queries against
`owner_meta`.
+
+**What these caches do NOT protect** (hit DB on every auth request without
entity cache):
+
+| Call | Description
| Protected by |
+|----------------------------------------------|-------------------------------------------|-------------------|
+| [A] `getUserEntity()` | Fetch User entity → get
integer userId | Entity cache only |
+| [B] `MetadataIdConverter.getID()` target | Resolve target resource name
→ integer ID | Entity cache only |
+| [C1] `listEntitiesByRelation(ROLE_USER_REL)` | Get user's role list
| Entity cache only |
+
+---
+
+### 1.4 Impact of Disabling Entity Cache
+
+Layer 2 sits **on top of** Layer 1. When Layer 1 is disabled (NoOpsCache),
calls [A], [B],
+[C1] hit DB on every auth request.
+
+| Call | With entity cache
| Without entity cache |
+|--------------------------------------------------|-------------------------------|---------------------------------|
+| [A] `getUserEntity()` | Cache hit after first
request | **DB query every auth request** |
+| [B] `MetadataIdConverter.getID()` target | Cache hit after first
request | **DB query every auth request** |
+| [C1] `listEntitiesByRelation(ROLE_USER_REL)` | Cache hit after first
request | **DB query every auth request** |
+| [C2] `entityStore.get(RoleEntity)` | Protected by
`loadedRoles` | DB only on cold role load |
+| [C3] `MetadataIdConverter.getID()` per privilege | Protected by
`loadedRoles` | DB only on cold role load |
+| [D1] `listEntitiesByRelation(OWNER_REL)` | Protected by `ownerRel`
| **DB query 2–4x per request** |
+
+---
+
+
+## 2. Goals
+
+### 2.1 The Two Problems to Solve
+
+**Problem 1 — Performance:** With entity cache disabled, [A] and [C1] hit DB
on every auth
+request. The new auth cache layer must protect these without relying on entity
store cache.
+([B] also hits DB, but this is correct and acceptable — see §1.5.)
+
+**Problem 2 — Consistency:** `loadedRoles` is TTL-bounded (1 hour staleness)
and updated by hook with in a instance. Permission
+changes must take effect at the next auth request, not after TTL expiry.
+
+Both problems are solved by the same mechanism: a version-validated cache for
the user's role
+list (userId comes for free from the same query).
+
+### 2.2 Requirements
+
+| Goal | Requirement
|
+|---------------------------------|---------------------------------------------------------------------------------------------------------------|
+| HA auth consistency | Privilege revocations visible on all nodes
at the next auth request |
+| Auth self-sufficiency | [A] and [C1] protected without relying on
entity store cache |
+| Auth performance | Hot path: ≤ 3 lightweight DB queries
|
+| No new mandatory infrastructure | Solution requires only the existing DB
|
+| Incremental delivery | Phase 1 independently shippable
|
+
+---
+
+## 3. Industry Reference
+
+### 3.1 Apache Polaris — Per-Entity Version Tracking
+
+#### Schema
+
+All entity types (catalogs, namespaces, tables, roles, principals) share a
single `ENTITIES`
+table (single-table inheritance). The two version columns are the key fields
for caching:
+
+```sql
+ENTITIES (
+ id BIGINT, -- Unique entity ID
+ catalog_id BIGINT, -- Owning catalog (0 for top-level entities)
+ parent_id BIGINT, -- Parent entity ID, forms the hierarchy
tree
+ type_code INT, -- Entity type enum (see hierarchy below)
+ name VARCHAR,
+ entity_version INT, -- Bumped on rename / property update /
drop ← key
+ sub_type_code INT, -- Subtype (ICEBERG_TABLE, ICEBERG_VIEW,
etc.)
+ properties JSON, -- User-visible properties (location,
format, etc.)
+ internal_properties JSON, -- Internal properties (credentials,
storage config, etc.)
+ grant_records_version INT, -- Bumped on every GRANT or REVOKE
← key
+)
+
+GRANT_RECORDS (
+ securable_catalog_id BIGINT,
+ securable_id BIGINT, -- The resource being secured
(table/namespace/catalog)
+ grantee_catalog_id BIGINT,
+ grantee_id BIGINT, -- The principal or role receiving the grant
+ privilege_code INT -- One of 102 defined privileges
+)
+```
+
+`GRANT_RECORDS` has no version column of its own. The version fingerprint is
stored in
+`ENTITIES.grant_records_version` — detecting staleness requires no scan of
`GRANT_RECORDS`.
+
+#### Entity Type Hierarchy
+
+```
+ROOT
+ ├── PRINCIPAL (user account, isGrantee)
+ ├── PRINCIPAL_ROLE (user-level role, isGrantee)
+ └── CATALOG
+ ├── CATALOG_ROLE (catalog-level role, isGrantee)
+ ├── NAMESPACE
+ │ └── TABLE_LIKE / POLICY / FILE
+ └── TASK
+```
+
+Only `PRINCIPAL`, `PRINCIPAL_ROLE`, and `CATALOG_ROLE` are **grantees** (can
receive grants).
+All others are **securables** (privileges are set on them).
+
+#### How `grantRecordsVersion` Is Maintained
+
+Every `grantPrivilege` / `revokePrivilege` call performs three writes in **one
DB transaction**:
+
+1. Insert or delete the `GRANT_RECORDS` row.
+2. Increment `grant_records_version` on the **grantee** entity row.
+3. Increment `grant_records_version` on the **securable** entity row.
+
+Both sides are bumped atomically — no separate changelog table is needed.
+
+#### Version-Validated Cache
+
+The cache unit is `ResolvedPolarisEntity` = entity metadata + grant records in
both directions.
+On every request, `bulkValidate()` issues one batch query for all path
entities:
+
+```sql
+SELECT * FROM ENTITIES WHERE (catalog_id, id) IN ((?, ?), ...)
+```
+
+| Path | Condition | Action
|
+|-------------------------|------------------------|----------------------------------------|
+| Cache hit | Both versions current | Serve from cache — **0
extra queries** |
+| Stale, targeted refresh | Either version behind | Reload only the changed
dimension |
+| Cache miss | Not in cache | Full load
|
+
+The DB is the single source of truth; no broadcast is needed for correctness.
+
+**Key difference from Gravitino:** Polaris bundles entity + grants in one
cached object, so one
+batch query covers both dimensions. Gravitino separates user→role from
role→privilege, requiring
+2 version-check queries on a warm hit (see §4.7 Step 1 and Step 3). Both
achieve strong
+consistency.
+
+### 3.2 Other References
+
+**Nessie** — HTTP fan-out invalidation: async POST to peer nodes on write,
convergence < 200 ms.
+
+**Keycloak** — JGroups embedded cluster messaging: in-JVM broadcast, no
separate service.
+Recommended future direction if Gravitino needs stronger delivery guarantees.
+
+**DB version polling** — monotonic counters incremented in write transaction;
a background
+thread polls for version changes and proactively invalidates caches.
Considered but not
+adopted; per-request validation (§4.7) achieves strong consistency without
background threads.
+
+---
+
+## 4. Design
+
+### 4.1 Design Overview
+
+Three caches drive auth performance: the user/group → role mapping, entity
name → integer ID,
+and ownership lookups. Each has different access frequency, mutation rate, and
security impact
+— and consequently a different consistency model.
+
+**Consistency tier 1 — strong (version-validated):** User-role assignments and
role-privilege
+definitions are security-critical. A revoked permission must not be served
from cache even one
+second after revocation. Each auth request issues two lightweight
version-check queries against
+`user_meta`, `group_meta`, and `role_meta`. If any `updated_at` timestamp has
advanced since
+the cached value, only the stale portion is reloaded. Staleness window:
**zero**.
+
+**Consistency tier 2 — eventual (write-path hook + change poller):** Entity
name→ID mappings
+and ownership records change far less frequently (DDL, ownership transfers)
and a brief window
+of inconsistency has lower security impact. The local node sees changes
immediately via hooks
+that fire after transaction commit. HA peer nodes converge within the change
poll interval
+(default 1 s) via two lightweight poll queries. No external infrastructure
(Kafka, Redis) is
+required — the existing DB is the single source of truth for both tiers.
+
+---
+
+### 4.1.1 Current-vs-Target Gap (Code-Aligned)
+
+The design below intentionally closes concrete gaps in the current
implementation:
+
+| Area | Current behavior (main branch)
| Target behavior (this design)
|
+|--------------------------------------|---------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
+| Role loading |
`JcasbinAuthorizer.loadRolePrivilege()` loads only `ROLE_USER_REL` |
Load both user direct roles and group-derived roles (`group_role_rel`)
|
+| Owner check path | `isOwner()` calls
`MetadataIdConverter.getID()` twice for the same object | Resolve metadata ID
once per call path and reuse |
+| Role cache coherence | `loadedRoles: Cache<Long, Boolean>`
is TTL-driven | `loadedRoles: roleId -> updated_at`
with per-request version validation |
+| Cross-node entity/owner invalidation | In-process hook/TTL only, no durable
HA invalidation stream | DB-backed pollers
(`owner_meta.updated_at`, `entity_change_log`) with targeted invalidation |
+| Request-scope dedup | `AuthorizationRequestContext` has
allow/deny result cache + `hasLoadRole` | Add request-scope owner/id dedup maps
with strict request-thread scope |
+
+This section is used as implementation acceptance criteria and should stay
synchronized with code
+changes in `server-common/.../JcasbinAuthorizer.java`,
+`server-common/.../MetadataIdConverter.java`, and
`core/.../AuthorizationRequestContext.java`.
+
+---
+
+### 4.2 Strong Consistency: User, Group, and Role Caches
+
+#### Why Strong Consistency Is Required
+
+Privilege revocations are the primary security enforcement operation. If a
user's role is
+revoked or a role's privilege is removed, the change must take effect on the
**next** auth
+request on any node, not after TTL expiry. TTL-only caching is fundamentally
unable to
+provide this guarantee.
+
+The chosen approach is Polaris-style per-request version validation: each row
in `user_meta`,
+`group_meta`, and `role_meta` carries an `updated_at` timestamp set in the
same DB transaction
+as the security write. On every auth request, the authorizer fetches these
timestamps and
+compares them against cached values. A mismatch triggers a targeted reload of
only the changed
+entry — not a full policy flush.
+
+Groups are **not optional**: a user can belong to a group that itself holds
role assignments.
+`group_meta.updated_at` receives the same treatment as `user_meta.updated_at`,
so group-role
+changes are immediately reflected everywhere.
+
+Using a timestamp instead of a monotonic counter has a theoretical
same-millisecond collision
+risk (two writes within 1 ms yield the same value → cache misses the second
change), but this
+is negligible for administrative operations (GRANT/REVOKE) in practice.
+
+#### Schema Changes
+
+```sql
+-- Role privilege tracking (strong consistency — Step 3 version check)
+ALTER TABLE `role_meta`
+ ADD COLUMN `updated_at` BIGINT NOT NULL DEFAULT 0
+ COMMENT 'Set to currentTimeMillis() on any privilege grant/revoke for this
role.
+ JcasbinAuthorizer compares db.updated_at vs cached updated_at per
request
+ to decide whether to reload JCasbin policies for this role.';
+
+-- User role assignment tracking (strong consistency — Step 1a version check)
+ALTER TABLE `user_meta`
+ ADD COLUMN `updated_at` BIGINT NOT NULL DEFAULT 0
+ COMMENT 'Set to currentTimeMillis() on any role assign/revoke for this
user.
+ JcasbinAuthorizer compares db.updated_at vs cached updated_at per
request
+ to decide whether to reload the user-role mapping.';
+
+-- Group role assignment tracking (strong consistency — Step 1b version check)
+ALTER TABLE `group_meta`
+ ADD COLUMN `updated_at` BIGINT NOT NULL DEFAULT 0
+ COMMENT 'Set to currentTimeMillis() on any role assign/revoke for this
group.
+ JcasbinAuthorizer compares db.updated_at vs cached updated_at per
request
+ to decide whether to reload the group-role mapping.';
+```
+
+#### Index and Backfill Notes
+
+To keep Step 1 and Step 3 checks predictable under load, add/verify covering
indexes for
+high-frequency predicates:
+
+```sql
+-- Suggested read-path indexes for version checks
+CREATE INDEX idx_user_meta_name_del_upd
+ ON user_meta (metalake_id, user_name, deleted_at, updated_at);
+CREATE INDEX idx_group_meta_del_upd
+ ON group_meta (group_id, deleted_at, updated_at);
+CREATE INDEX idx_role_meta_del_upd
+ ON role_meta (role_id, deleted_at, updated_at);
+CREATE INDEX idx_owner_meta_obj_del_upd
+ ON owner_meta (metadata_object_id, deleted_at, updated_at);
+```
+
+Backfill strategy for the newly added `updated_at` columns:
+
+1. DDL adds columns with default `0`.
+2. One-time backfill sets `updated_at = create_time` (or `last_modified_time`
if available)
+ for existing active rows.
+3. New write-path hooks become the long-term source of truth.
+
+Using explicit backfill avoids a long-lived "all zero" window that would force
unnecessary cold
+reloads at rollout time.
+
+---
+
+### 4.3 Eventual Consistency: Ownership Cache (`ownerRelCache`)
+
+#### Why `ownerRelCache` Is Critical for Performance
+
+Nearly all authorization expressions include `ANY(OWNER, METALAKE, CATALOG)` or
+`ANY(OWNER, METALAKE, CATALOG, SCHEMA, ...)`. These expand via OGNL to a chain
of
+`METALAKE::OWNER || CATALOG::OWNER || ...` calls. Each term calls `isOwner()`
**directly**,
+independent of the `authorize()` path. Every auth request triggers **2–4
`isOwner()` calls**
+(one per ancestor level). Without a cache, this adds 2–4 extra `owner_meta` DB
queries per
+request. For most non-owner users, the result is `Optional.empty()`, so the
cache primarily
+stores empty-ownership negatives that let the check fail quickly.
+
+#### Why Version-Validated Caching Is Unnecessary for Ownership
+
+| Cache | What a version check
returns | What it saves
|
+|--------------------------------------------------|----------------------------------|-----------------------------------------------------------------------------------|
+| `loadedRoles` | `(role_id, updated_at)`
| Skips reloading all securable objects + JCasbin `addPolicy` calls —
**expensive** |
+| `ownerRelCache` (hypothetical version-validated) | `(metadata_object_id,
owner_id)` | Nothing — the version check query **already returns `owner_id`**
|
+
+A version-validated `ownerRelCache` would add schema columns, write-path
version bumps, and
+per-request version queries — while saving exactly zero DB queries beyond what
the version
+check itself costs. Complexity without benefit.
+
+#### Invalidation Strategy: TTL Safety-Net + Write-Path Hook + Owner Change
Poller
+
+`ownerRelCache` uses a three-layer strategy:
+
+1. **Local node — immediate**: `handleMetadataOwnerChange()` hook fires after
the ownership
+ transfer transaction commits and calls
`ownerRelCache.invalidate(metadataId)`.
+2. **HA peer nodes — targeted, near real-time (≤ 1 s)**: the owner change
poller queries
+ `owner_meta WHERE updated_at > maxOwnerUpdatedAt`. For each returned row it
calls
+ `ownerRelCache.invalidate(metadataObjectId)` — only the changed entries are
evicted;
+ unrelated cached ownerships remain hot.
+3. **TTL — safety net only**: a long TTL (e.g. 1 hour) catches any missed
invalidation
+ (e.g. poller downtime). Correctness relies on hook + poller, not TTL.
+
+`owner_meta` is a 1:1 table (one row per entity with an owner). The poller can
read
+`updated_at` directly from the source table and immediately get the
`metadata_object_id` to
+invalidate — no intermediate log table is needed. This avoids write
amplification and keeps
+the design simple.
+
+#### Why Eventual Consistency Is Safe for Ownership
+
+Privilege revocation (GRANT/REVOKE) is handled by the **strong-consistency**
Steps 1 + 3.
+Ownership transfer is an administrative reorganisation, not an emergency
access revocation —
+a ≤ 1 s grace period on HA peer nodes is operationally acceptable and
consistent with how
+similar systems (AWS IAM, Apache Polaris) treat structural metadata changes.
+
+#### Schema Change
+
+```sql
+-- Ownership mutation tracking (eventual consistency — owner change poller)
+ALTER TABLE `owner_meta`
+ ADD COLUMN `updated_at` BIGINT NOT NULL DEFAULT 0
+ COMMENT 'Set to currentTimeMillis() on any ownership transfer.
+ The owner change poller reads updated_at > maxSeen to find
changed rows
+ and invalidates only the specific metadataObjectIds in
ownerRelCache.';
+```
+
+---
+
+### 4.4 Eventual Consistency: Name→ID Cache (`metadataIdCache`)
+
+#### The Problem: Repeated `getID()` Calls in OGNL Expression Evaluation
+
+`MetadataIdConverter.getID()` calls `entityStore.get()` for every unique
`(MetadataObject,
+privilege)` combination in the OGNL expression. The `allowAuthorizerCache`
deduplicates
+complete `(principal, metalake, obj, privilege)` results, but different
privileges on the same
+object (e.g. `METALAKE::USE_CATALOG`, `METALAKE::USE_SCHEMA`,
`METALAKE::DENY_USE_CATALOG`)
+each trigger a separate `getID(METALAKE)` call. A full
`LOAD_TABLE_AUTHORIZATION_EXPRESSION`
+evaluation can trigger **8–12 `getID()` calls**, of which most are for the
same 3–4 objects.
+
+#### Hierarchical Cache Key with Prefix-Based Cascade Invalidation
+
+The cache key uses a hierarchical `::` separator that enables prefix-based
cascade eviction:
+
+| Entity type | Key example | Is non-leaf? |
+|-------------|---------------------------------|-------------------|
+| METALAKE | `lake1::` | ✓ (trailing `::`) |
+| CATALOG | `lake1::cat1::` | ✓ |
+| SCHEMA | `lake1::cat1::s1::` | ✓ |
+| TABLE | `lake1::cat1::s1::t1::TABLE` | leaf |
+| FILESET | `lake1::cat1::s1::fs1::FILESET` | leaf |
+| TOPIC | `lake1::cat1::s1::tp1::TOPIC` | leaf |
+| MODEL | `lake1::cat1::s1::m1::MODEL` | leaf |
+| VIEW | `lake1::cat1::s1::v1::VIEW` | leaf |
+
+`invalidateByPrefix("lake1::cat1::")` evicts the catalog entry AND all
schemas, tables,
+filesets, and other entities beneath it in a single O(n) pass over the cache
(bounded, DDL is
+rare).
+
+#### Why `entity_change_log` Instead of Adding `updated_at` to Entity Tables
+
+The natural alternative is adding `updated_at` to each entity table
(metalake_meta,
+catalog_meta, schema_meta, table_meta, …) and polling them directly. This has
three
+fundamental problems:
+
+| Problem | Explanation
|
+|------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **Old name unavailable after rename** | Entity tables store the
**current** name only. After `table1 → table2`, the row holds `table2`. The
poller can detect *a* change happened but cannot reconstruct the old cache key
to invalidate it. |
+| **JOIN cost to reconstruct full path** | Nested tables (schema,
table, fileset, …) store only their simple name. Rebuilding the full
`catalog.schema.table` path requires multi-level JOINs per entity type — eight
separate queries or a complex UNION every poll cycle. |
+| **Cascade requires scanning all child tables** | Dropping `cat1` means also
polling schema_meta, table_meta, … for all rows under `cat1`. With
`entity_change_log`, **one row** for the catalog +
`invalidateByPrefix("lake::cat1::")` evicts the entire subtree. |
+
+`entity_change_log` solves all three: `(metalake_name, entity_type,
full_name)` gives the
+poller exactly what it needs to call `buildCacheKey + invalidateByPrefix`, and
`operate_type`
+documents the nature of the change for observability.
+
+**Rename correctness:** Rename does not change `metadataId`. The write path
logs
+`operate_type=ALTER, full_name=oldName`. The poller calls
`invalidateByPrefix(buildCacheKey
+(lake, CATALOG, "cat1"))` → evicts `lake::cat1::` and all children. New name
keys are cold
+misses → DB → same numeric ids. Policy in JCasbin is unaffected (keyed on
numeric ids, not
+names).
+
+**Per-request dedup:** A `Map<String, Long>` in `AuthorizationRequestContext`
provides an
+additional within-request dedup layer. When the Caffeine cache is cold, the
first `getID()`
+call for a given object populates the request-level map; subsequent calls
within the same
+request avoid repeated Caffeine lookups. On the warm path (Caffeine hits), the
request-level
+map is a minor CPU optimisation.
+
+**Write amplification:** one row per affected entity per operation — never per
child. Entity
+DDL is rare in production. Rows are pruned after a configurable retention
window (default 1
+hour).
+
+#### Invalidation Strategy: Persistent Caffeine Cache + Write-Path Hook +
Entity Change Log Poller
+
+`metadataIdCache` uses a three-layer strategy:
+
+1. **Local node — immediate**: `handleEntityStructuralChange()` hook fires
after transaction
+ commit and calls `metadataIdCache.invalidateByPrefix(buildCacheKey(...))`.
Non-leaf entities
+ cascade to all children; leaf entities match exactly one entry.
+2. **HA peer nodes — cascade, near real-time (≤ 1 s)**: the entity change
poller reads
+ `entity_change_log WHERE created_at > maxEntityCreatedAt`, rebuilds the
cache key from
+ `(metalake_name, entity_type, full_name)`, and calls `invalidateByPrefix` —
**one log row
+ per DROP/ALTER operation regardless of how many children exist**.
+3. **TTL — safety net only**: a long TTL (e.g. 1 hour) as a last resort.
+
+#### Schema Change
+
+```sql
+-- Entity name→id mutation tracking (eventual consistency — entity change
poller)
+CREATE TABLE `entity_change_log` (
+ `id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
+ `metalake_name` VARCHAR(128) NOT NULL,
+ `entity_type` VARCHAR(32) NOT NULL
+ COMMENT 'METALAKE | CATALOG | SCHEMA | TABLE | FILESET | TOPIC | MODEL |
VIEW',
+ `full_name` VARCHAR(512) NOT NULL
+ COMMENT 'Dot-separated full name of the affected entity. For RENAME,
stores the
+ OLD name (the stale key to invalidate). For DROP/ALTER, the
entity name.',
+ `operate_type` VARCHAR(16) NOT NULL
+ COMMENT 'DROP | CREATE | ALTER (ALTER covers rename and other structural
changes)',
+ `created_at` BIGINT NOT NULL,
+ PRIMARY KEY (`id`),
+ INDEX `idx_created_at` (`created_at`)
+) COMMENT 'Append-only log of entity structural changes.
+ One row per affected entity per operation. The entity change poller
reads
+ this table to drive targeted invalidation of metadataIdCache on HA
peer nodes.
+ Rows older than the retention window (default 1 h) are pruned
periodically.';
+```
+
+---
+
+### 4.5 Write Path Invariants
+
+All schema tracking writes must execute **in the same DB transaction** as the
data change.
+If the transaction rolls back, none of the tracking writes are committed — no
spurious cache
+invalidations on HA peers.
+
+| Operation | Schema write
| Location
|
+|----------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------|
+| Grant / revoke privilege on role R | `role_meta.updated_at = now() WHERE
role_id = R`
| `RoleMetaService`
|
+| Assign / revoke role for user U | `user_meta.updated_at = now() WHERE
user_id = U`
| `UserMetaService`
|
+| Assign / revoke role for group G | `group_meta.updated_at = now()
WHERE group_id = G`
| `GroupMetaService`
|
+| Ownership transfer for entity E | `owner_meta.updated_at = now()
WHERE metadata_object_id = E`
| `OwnerMetaService`
|
+| Rename entity (old name → new name) | INSERT into `entity_change_log`:
`operate_type=ALTER`, **old** `full_name`, `created_at=now()`
| All entity MetaService classes
|
+| Drop entity with name N | INSERT into `entity_change_log`:
`operate_type=DROP`, `full_name=N`, `created_at=now()`
| All entity MetaService classes
|
+| Create entity with name N | INSERT into `entity_change_log`:
`operate_type=CREATE`, `full_name=N`, `created_at=now()` (optional — new names
are cache misses; insert is a safety net only) | All entity MetaService classes
|
+
+---
+
+### 4.6 Cache Data Structures (Changes in JcasbinAuthorizer)
+
+```java
+// ─── BEFORE ──────────────────────────────────────────────────────────
+private Cache<Long, Boolean> loadedRoles; // roleId → loaded?
+private Cache<Long, Optional<Long>> ownerRel;
+
+// ─── AFTER ───────────────────────────────────────────────────────────
+
+// NEW: replaces entity cache dependency for [A] (userId) and [C1] (role list).
+// Step 1 query returns both user_id and updated_at in one shot.
+// metalakeName→metalakeId resolved inline via JOIN — no dedicated cache
needed.
+private GravitinoCache<String, CachedUserRoles> userRoleCache;
+// key = metalakeName + ":" + userName
+
+record CachedUserRoles(
+ long userId, // integer userId for JCasbin enforce()
+ long updatedAt, // user_meta.updated_at at load time — staleness
sentinel
+ List<Long> roleIds // role ID list at load time
+) {}
+
+// NEW: mirrors userRoleCache for groups (group can also hold role
assignments).
+private GravitinoCache<String, CachedGroupRoles> groupRoleCache;
+// key = metalakeName + ":" + groupName
+
+record CachedGroupRoles(
+ long groupId,
+ long updatedAt, // group_meta.updated_at at load time — staleness
sentinel
+ List<Long> roleIds
+) {}
+
+// TYPE CHANGE: was Cache<Long, Boolean>, now stores role_meta.updated_at.
+// Enables staleness detection rather than TTL expiry.
+private GravitinoCache<Long, Long> loadedRoles;
+// roleId → role_meta.updated_at at the time JCasbin policies were loaded
+
+// NEW: caches name → integer id for every MetadataObject referenced in OGNL
expressions.
+// Without this, every authorize()/isOwner() call triggers entityStore.get()
for each unique
+// object in the expression chain (METALAKE, CATALOG, SCHEMA, TABLE etc.),
multiplied by the
+// number of distinct privilege checks on that object.
+// Consistency: immediate on local node via handleEntityStructuralChange()
hook on drop/rename;
+// HA peer nodes: entity change poller reads entity_change_log
WHERE created_at > maxSeen
Review Comment:
Nit: entity_change_log already has an auto-increment id PK. Consider using
WHERE id > lastSeenId as the poller cursor instead of WHERE created_at >
createdAtAfter. It's a PK scan (faster than a secondary index scan on
created_at), and the idx_created_at index can then be kept only for pruning.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]