jerryshao commented on PR #10696:
URL: https://github.com/apache/gravitino/pull/10696#issuecomment-4243040273
## Design Review Feedback
Thanks for the detailed design document! The version-validated cache
approach is architecturally sound and the Polaris-inspired analysis is
excellent. Here's design-level feedback before implementation begins:
---
### 🔴 Must Fix
**#1: Version column duplication violates existing schema patterns**
The proposed `role_meta.securable_objects_version`,
`user_meta.role_grants_version`, etc. duplicate functionality already provided
by the existing `current_version` / `last_version` columns present on **all 20+
metadata tables**. Adding a third versioning scheme creates confusion about
which column to increment on a given mutation and creates migration complexity.
Recommend either reusing `current_version` (increment on any mutation) or
following the `table_column_version_info` / `function_version_info` pattern to
add a dedicated version event table.
**#2: `GravitinoCache` interface is missing lock primitives → race
conditions**
The proposed `GravitinoCache<K,V>` interface has no locking methods. The
existing `EntityCache` interface defines three critical lock primitives
(`withCacheLock`, `withMultipleKeyCacheLock`) backed by `SegmentedLock`. The
multi-step read-check-update in the auth flow (§4.1.3 Step 1) has a classic
race condition without them:
```
Thread A: reads version=5, cache has version=4 → starts reload
Thread B: reads version=5, cache has version=4 → starts reload (duplicate
query, competing write)
```
`GravitinoCache` must expose the same locking primitives as `EntityCache`.
**#3: Group role loading is a new feature, not a caching optimization —
scope creep**
§4.1.2 introduces `groupRoleCache` and §4.1.3 Step 1b adds group-based role
loading. The doc itself notes: *"current code only loads user-direct roles
(ROLE_USER_REL). Loading group roles via [1b] is a NEW capability."*
This changes the authorization permission model (users in groups now inherit
group roles), which has a different risk profile, requires separate security
testing, and affects audit compliance. Strongly recommend splitting:
- **Phase 0 (new PR):** Add group role inheritance using existing TTL-based
caching
- **Phase 2 (this PR):** Migrate all caches to version-validated mechanism
---
### 🟡 Should Fix
**#4: Multi-entity version bump scenarios are underdefined**
The design shows version bumps for single-entity mutations but doesn't
address complex cross-entity operations: ownership transfer, adding a user to a
group with existing roles, privilege grant cascades. Does `user_meta` need a
version bump when the user's group gains a new role? Add a decision table
covering all mutation scenarios.
**#5: "Lightweight" query claims are unquantified**
Step 1a involves a JOIN across `metalake_meta` + `user_meta` — that's two
index scans, not a trivial lookup. The Phase 2 hot path has 4 required DB
queries per auth request. Please add benchmarking results (or commit to
profiling before Phase 2 implementation) to validate the performance claims.
**#6: No degraded mode strategy for DB unavailability**
Phase 2 makes every auth request require live DB connectivity, removing the
current passive resilience (today's entity cache can serve from memory during
short DB outages within the TTL window). The statement *"DB unavailable → auth
blocked (same as today)"* is inaccurate. Please add a §4.2 covering the chosen
approach: stale-on-error, circuit breaker, or explicit acceptance that DB
becomes a single point of failure for auth.
**#7: `ownerRel` replacement — implementation details missing**
The proposal to replace `ownerRel` with a request-scoped cache in
`AuthorizationRequestContext` is sound, but the design omits: context lifecycle
(ThreadLocal vs. request bean), cache key structure, and thread-safety for
parallel auth checks. Please document the concrete
`AuthorizationRequestContext` changes. Also, the existing `getID()` double-call
bug (lines 224/228) should be fixed as a **separate commit** rather than
bundled into Phase 2.
**#8: MyBatis mapper return type is semantically incorrect**
`Map<Long, Integer> batchGetSecurableObjectsVersions(List<Long> roleIds)`
with `resultType="map"` won't work as intended — MyBatis returns
`List<Map<String, Object>>` for that configuration, not a flattened `Map<Long,
Integer>`. Recommend using a dedicated VO (`record RoleVersion(long roleId, int
version) {}`) with a proper `<resultMap>`.
---
### ✅ Strong Points
- Polaris reference analysis (§3.1) is excellent industry research
- TOCTOU/race condition analysis (§4.1.6) is thorough
- Phase 1/2 split is well-reasoned and de-risks schema migration
- `ownerRel` decision log (§7.2) shows strong self-review
---
Overall the direction is right. The three 🔴 issues — especially scope creep
on group roles and missing locking in the cache interface — should be resolved
before moving to implementation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]