jerryshao commented on PR #10696:
URL: https://github.com/apache/gravitino/pull/10696#issuecomment-4243320011

   > ## Design Review Feedback
   > Thanks for the detailed design document! The version-validated cache 
approach is architecturally sound and the Polaris-inspired analysis is 
excellent. Here's design-level feedback before implementation begins:
   > 
   > ### 🔴 Must Fix
   > **#1: Version column duplication violates existing schema patterns** The 
proposed `role_meta.securable_objects_version`, 
`user_meta.role_grants_version`, etc. duplicate functionality already provided 
by the existing `current_version` / `last_version` columns present on **all 20+ 
metadata tables**. Adding a third versioning scheme creates confusion about 
which column to increment on a given mutation and creates migration complexity. 
Recommend either reusing `current_version` (increment on any mutation) or 
following the `table_column_version_info` / `function_version_info` pattern to 
add a dedicated version event table.
   > 
   > **#2: `GravitinoCache` interface is missing lock primitives → race 
conditions** The proposed `GravitinoCache<K,V>` interface has no locking 
methods. The existing `EntityCache` interface defines three critical lock 
primitives (`withCacheLock`, `withMultipleKeyCacheLock`) backed by 
`SegmentedLock`. The multi-step read-check-update in the auth flow (§4.1.3 Step 
1) has a classic race condition without them:
   > 
   > ```
   > Thread A: reads version=5, cache has version=4 → starts reload
   > Thread B: reads version=5, cache has version=4 → starts reload (duplicate 
query, competing write)
   > ```
   > 
   > `GravitinoCache` must expose the same locking primitives as `EntityCache`.
   > 
   > **#3: Group role loading is a new feature, not a caching optimization — 
scope creep** §4.1.2 introduces `groupRoleCache` and §4.1.3 Step 1b adds 
group-based role loading. The doc itself notes: _"current code only loads 
user-direct roles (ROLE_USER_REL). Loading group roles via [1b] is a NEW 
capability."_ This changes the authorization permission model (users in groups 
now inherit group roles), which has a different risk profile, requires separate 
security testing, and affects audit compliance. Strongly recommend splitting:
   > 
   > * **Phase 0 (new PR):** Add group role inheritance using existing 
TTL-based caching
   > * **Phase 2 (this PR):** Migrate all caches to version-validated mechanism
   > 
   > ### 🟡 Should Fix
   > **#4: Multi-entity version bump scenarios are underdefined** The design 
shows version bumps for single-entity mutations but doesn't address complex 
cross-entity operations: ownership transfer, adding a user to a group with 
existing roles, privilege grant cascades. Does `user_meta` need a version bump 
when the user's group gains a new role? Add a decision table covering all 
mutation scenarios.
   > 
   > **#5: "Lightweight" query claims are unquantified** Step 1a involves a 
JOIN across `metalake_meta` + `user_meta` — that's two index scans, not a 
trivial lookup. The Phase 2 hot path has 4 required DB queries per auth 
request. Please add benchmarking results (or commit to profiling before Phase 2 
implementation) to validate the performance claims.
   > 
   > **#6: No degraded mode strategy for DB unavailability** Phase 2 makes 
every auth request require live DB connectivity, removing the current passive 
resilience (today's entity cache can serve from memory during short DB outages 
within the TTL window). The statement _"DB unavailable → auth blocked (same as 
today)"_ is inaccurate. Please add a §4.2 covering the chosen approach: 
stale-on-error, circuit breaker, or explicit acceptance that DB becomes a 
single point of failure for auth.
   > 
   > **#7: `ownerRel` replacement — implementation details missing** The 
proposal to replace `ownerRel` with a request-scoped cache in 
`AuthorizationRequestContext` is sound, but the design omits: context lifecycle 
(ThreadLocal vs. request bean), cache key structure, and thread-safety for 
parallel auth checks. Please document the concrete 
`AuthorizationRequestContext` changes. Also, the existing `getID()` double-call 
bug (lines 224/228) should be fixed as a **separate commit** rather than 
bundled into Phase 2.
   > 
   > **#8: MyBatis mapper return type is semantically incorrect** `Map<Long, 
Integer> batchGetSecurableObjectsVersions(List<Long> roleIds)` with 
`resultType="map"` won't work as intended — MyBatis returns `List<Map<String, 
Object>>` for that configuration, not a flattened `Map<Long, Integer>`. 
Recommend using a dedicated VO (`record RoleVersion(long roleId, int version) 
{}`) with a proper `<resultMap>`.
   > 
   > ### ✅ Strong Points
   > * Polaris reference analysis (§3.1) is excellent industry research
   > * TOCTOU/race condition analysis (§4.1.6) is thorough
   > * Phase 1/2 split is well-reasoned and de-risks schema migration
   > * `ownerRel` decision log (§7.2) shows strong self-review
   > 
   > Overall the direction is right. The three 🔴 issues — especially scope 
creep on group roles and missing locking in the cache interface — should be 
resolved before moving to implementation.
   
    Please evaluate the AI feedbacks @yuqi1129 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to