yuqi1129 commented on PR #10696:
URL: https://github.com/apache/gravitino/pull/10696#issuecomment-4250082275
> last_version
> > ## Design Review Feedback
> > Thanks for the detailed design document! The version-validated cache
approach is architecturally sound and the Polaris-inspired analysis is
excellent. Here's design-level feedback before implementation begins:
> > ### ๐ด Must Fix
> > **#1: Version column duplication violates existing schema patterns** The
proposed `role_meta.securable_objects_version`,
`user_meta.role_grants_version`, etc. duplicate functionality already provided
by the existing `current_version` / `last_version` columns present on **all 20+
metadata tables**. Adding a third versioning scheme creates confusion about
which column to increment on a given mutation and creates migration complexity.
Recommend either reusing `current_version` (increment on any mutation) or
following the `table_column_version_info` / `function_version_info` pattern to
add a dedicated version event table.
> > **#2: `GravitinoCache` interface is missing lock primitives โ race
conditions** The proposed `GravitinoCache<K,V>` interface has no locking
methods. The existing `EntityCache` interface defines three critical lock
primitives (`withCacheLock`, `withMultipleKeyCacheLock`) backed by
`SegmentedLock`. The multi-step read-check-update in the auth flow (ยง4.1.3 Step
1) has a classic race condition without them:
> > ```
> > Thread A: reads version=5, cache has version=4 โ starts reload
> > Thread B: reads version=5, cache has version=4 โ starts reload
(duplicate query, competing write)
> > ```
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > `GravitinoCache` must expose the same locking primitives as
`EntityCache`.
> > **#3: Group role loading is a new feature, not a caching optimization โ
scope creep** ยง4.1.2 introduces `groupRoleCache` and ยง4.1.3 Step 1b adds
group-based role loading. The doc itself notes: _"current code only loads
user-direct roles (ROLE_USER_REL). Loading group roles via [1b] is a NEW
capability."_ This changes the authorization permission model (users in groups
now inherit group roles), which has a different risk profile, requires separate
security testing, and affects audit compliance. Strongly recommend splitting:
> >
> > * **Phase 0 (new PR):** Add group role inheritance using existing
TTL-based caching
> > * **Phase 2 (this PR):** Migrate all caches to version-validated
mechanism
> >
> > ### ๐ก Should Fix
> > **#4: Multi-entity version bump scenarios are underdefined** The design
shows version bumps for single-entity mutations but doesn't address complex
cross-entity operations: ownership transfer, adding a user to a group with
existing roles, privilege grant cascades. Does `user_meta` need a version bump
when the user's group gains a new role? Add a decision table covering all
mutation scenarios.
> > **#5: "Lightweight" query claims are unquantified** Step 1a involves a
JOIN across `metalake_meta` + `user_meta` โ that's two index scans, not a
trivial lookup. The Phase 2 hot path has 4 required DB queries per auth
request. Please add benchmarking results (or commit to profiling before Phase 2
implementation) to validate the performance claims.
> > **#6: No degraded mode strategy for DB unavailability** Phase 2 makes
every auth request require live DB connectivity, removing the current passive
resilience (today's entity cache can serve from memory during short DB outages
within the TTL window). The statement _"DB unavailable โ auth blocked (same as
today)"_ is inaccurate. Please add a ยง4.2 covering the chosen approach:
stale-on-error, circuit breaker, or explicit acceptance that DB becomes a
single point of failure for auth.
> > **#7: `ownerRel` replacement โ implementation details missing** The
proposal to replace `ownerRel` with a request-scoped cache in
`AuthorizationRequestContext` is sound, but the design omits: context lifecycle
(ThreadLocal vs. request bean), cache key structure, and thread-safety for
parallel auth checks. Please document the concrete
`AuthorizationRequestContext` changes. Also, the existing `getID()` double-call
bug (lines 224/228) should be fixed as a **separate commit** rather than
bundled into Phase 2.
> > **#8: MyBatis mapper return type is semantically incorrect** `Map<Long,
Integer> batchGetSecurableObjectsVersions(List<Long> roleIds)` with
`resultType="map"` won't work as intended โ MyBatis returns `List<Map<String,
Object>>` for that configuration, not a flattened `Map<Long, Integer>`.
Recommend using a dedicated VO (`record RoleVersion(long roleId, int version)
{}`) with a proper `<resultMap>`.
> > ### โ
Strong Points
> >
> > * Polaris reference analysis (ยง3.1) is excellent industry research
> > * TOCTOU/race condition analysis (ยง4.1.6) is thorough
> > * Phase 1/2 split is well-reasoned and de-risks schema migration
> > * `ownerRel` decision log (ยง7.2) shows strong self-review
> >
> > Overall the direction is right. The three ๐ด issues โ especially scope
creep on group roles and missing locking in the cache interface โ should be
resolved before moving to implementation.
>
> Please evaluate the AI feedbacks @yuqi1129 .
> Version column duplication violates existing schema patterns
For the fields `current_version` or `last_version` in most tables like
`metalake_meta`, `catalog_meta`, and so on are always 1(default value) and
can't satisfy our meet.
Of course, we can fix it, but we need to manage at least 8(metalake,
catalog, schema, fileset, table, model, view, topic) types of versions in the
memory, it's very tedious.
> 2. GravitinoCache interface is missing lock primitives โ race conditions
`GravitinoCache` is just an interface, and a detailed implementation will
add the necessary locks to avoid race conditions.
> 3. Group role loading is a new feature, not a caching optimization
This is a complement to the document and for future use only. It will not be
implemented in the first stage.
> Multi-entity version bump scenarios are underdefined
I have covered the changes.
For 5, 6, 7. As we have adopted an eventually-consistent cache, there should
not be a problem.
For 8, I have fixed it to use POJO NOT a map.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]