[
https://issues.apache.org/jira/browse/CASSANDRA-21200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18071583#comment-18071583
]
Runtian Liu commented on CASSANDRA-21200:
-----------------------------------------
4.1 PR ready: https://issues.apache.org/jira/browse/CASSANDRA-21200
> Replica Slot Grouping: Prevent write availability degradation during topology
> changes
> -------------------------------------------------------------------------------------
>
> Key: CASSANDRA-21200
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21200
> Project: Apache Cassandra
> Issue Type: New Feature
> Components: Consistency/Coordination
> Reporter: Runtian Liu
> Assignee: Runtian Liu
> Priority: Normal
>
> Summary
> During topology changes (bootstrap, decommission, move, replacement),
> Cassandra adds pending replicas to the write path and inflates {{blockFor}}
> by {{{}pending.size(){}}}. This means writes must wait for acknowledgments
> from both natural and pending replicas before succeeding. If the pending
> replica is slow or unresponsive (common during streaming-heavy transitions),
> writes time out even though the natural quorum is fully available.
> This issue affects both regular writes and Lightweight Transactions / Paxos
> V2.
> Problem
> Consider a 3-node cluster with RF=3 and QUORUM writes ({{{}blockFor = 2{}}}).
> When a 4th node begins bootstrapping:
> * The pending replica is added to the write set, inflating {{blockFor}} to 3
> * Writes now require 3 acknowledgments instead of 2
> * If the bootstrapping node is busy streaming and responds slowly, writes
> time out
> * The cluster effectively loses write availability during what should be a
> routine operation
> The same issue occurs during node replacement, decommission, and range
> movement. The fundamental problem is that {{blockFor}} inflation treats the
> pending replica as an independent additional requirement, when logically it
> is _transitioning into_ an existing replica's slot.
> Proposed Solution: Replica Slot Grouping
> Instead of inflating {{{}blockFor{}}}, group endpoints that are transitioning
> the same logical replica position into a single "slot":
> * A stable slot contains a single endpoint (no topology change in progress).
> Requires 1 ack.
> * A transitioning slot contains two endpoints (e.g., a leaving node and its
> replacement, or a bootstrapping node and the existing owner). Requires acks
> from all members (both old and new) for the slot to count as satisfied.
> {{blockFor}} stays at the base quorum level (no inflation from pending
> replicas). The quorum is counted in terms of slots rather than individual
> endpoints.
> Example: In the 3-node + 1 bootstrapping scenario above:
> * 3 slots total: 2 stable (nodes that are unaffected) + 1 transitioning
> (existing owner + bootstrapping node)
> * QUORUM still requires 2 slots
> * If the bootstrapping node is slow, only 2 stable slots respond — that's 2
> slots, meeting quorum
> * If the bootstrapping node responds, the transitioning slot is satisfied,
> also meeting quorum
> * In both cases, writes succeed without timeout
> Correctness properties:
> * Durability is maintained: when the transitioning slot is satisfied, writes
> reach both old and new replicas
> * Consistency is maintained: quorum intersection properties are preserved
> because slot-based quorum >= natural quorum
> * Availability is improved: writes no longer block on slow/unresponsive
> pending replicas
> Implementation across branches: TCM vs pre-TCM
> In trunk (with TCM), the Transient Cluster Metadata framework already
> maintains explicit transition state for each replica — the information about
> which endpoints are joining, leaving, or replacing is directly available in
> the cluster metadata. The slot grouping logic can consume this transition
> information directly without any additional calculation.
> In pre-TCM branches (4.x, 5.x), this transition information does not exist as
> a first-class concept. The implementation needs to derive it by analyzing
> {{TokenMetadata}} — computing which pending endpoints map to which existing
> endpoints based on bootstrap tokens, leaving tokens, and the replication
> strategy. This calculation is performed on topology changes and cached per
> keyspace, so the hot write path only performs a map lookup.
> Feature flag:
> * {{replica_slot_grouping_enabled}} (cassandra.yaml, runtime-toggleable via
> JMX)
> * Default: disabled (preserves existing behavior)
> * When disabled: zero code path changes, no performance impact
> Compatibility:
> * Purely additive; no wire protocol or schema changes
> * Feature flag ensures safe rollout and rollback
> * Can be backported to 4.x and 5.x
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]