[ 
https://issues.apache.org/jira/browse/CASSANDRA-21200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18071583#comment-18071583
 ] 

Runtian Liu commented on CASSANDRA-21200:
-----------------------------------------

4.1 PR ready: https://issues.apache.org/jira/browse/CASSANDRA-21200

> Replica Slot Grouping: Prevent write availability degradation during topology 
> changes
> -------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21200
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21200
>             Project: Apache Cassandra
>          Issue Type: New Feature
>          Components: Consistency/Coordination
>            Reporter: Runtian Liu
>            Assignee: Runtian Liu
>            Priority: Normal
>
> Summary
> During topology changes (bootstrap, decommission, move, replacement), 
> Cassandra adds pending replicas to the write path and inflates {{blockFor}} 
> by {{{}pending.size(){}}}. This means writes must wait for acknowledgments 
> from both natural and pending replicas before succeeding. If the pending 
> replica is slow or unresponsive (common during streaming-heavy transitions), 
> writes time out even though the natural quorum is fully available.
> This issue affects both regular writes and Lightweight Transactions / Paxos 
> V2.
> Problem
> Consider a 3-node cluster with RF=3 and QUORUM writes ({{{}blockFor = 2{}}}). 
> When a 4th node begins bootstrapping:
>  * The pending replica is added to the write set, inflating {{blockFor}} to 3
>  * Writes now require 3 acknowledgments instead of 2
>  * If the bootstrapping node is busy streaming and responds slowly, writes 
> time out
>  * The cluster effectively loses write availability during what should be a 
> routine operation
> The same issue occurs during node replacement, decommission, and range 
> movement. The fundamental problem is that {{blockFor}} inflation treats the 
> pending replica as an independent additional requirement, when logically it 
> is _transitioning into_ an existing replica's slot.
> Proposed Solution: Replica Slot Grouping
> Instead of inflating {{{}blockFor{}}}, group endpoints that are transitioning 
> the same logical replica position into a single "slot":
>  * A stable slot contains a single endpoint (no topology change in progress). 
> Requires 1 ack.
>  * A transitioning slot contains two endpoints (e.g., a leaving node and its 
> replacement, or a bootstrapping node and the existing owner). Requires acks 
> from all members (both old and new) for the slot to count as satisfied.
> {{blockFor}} stays at the base quorum level (no inflation from pending 
> replicas). The quorum is counted in terms of slots rather than individual 
> endpoints.
> Example: In the 3-node + 1 bootstrapping scenario above:
>  * 3 slots total: 2 stable (nodes that are unaffected) + 1 transitioning 
> (existing owner + bootstrapping node)
>  * QUORUM still requires 2 slots
>  * If the bootstrapping node is slow, only 2 stable slots respond — that's 2 
> slots, meeting quorum
>  * If the bootstrapping node responds, the transitioning slot is satisfied, 
> also meeting quorum
>  * In both cases, writes succeed without timeout
> Correctness properties:
>  * Durability is maintained: when the transitioning slot is satisfied, writes 
> reach both old and new replicas
>  * Consistency is maintained: quorum intersection properties are preserved 
> because slot-based quorum >= natural quorum
>  * Availability is improved: writes no longer block on slow/unresponsive 
> pending replicas
> Implementation across branches: TCM vs pre-TCM
> In trunk (with TCM), the Transient Cluster Metadata framework already 
> maintains explicit transition state for each replica — the information about 
> which endpoints are joining, leaving, or replacing is directly available in 
> the cluster metadata. The slot grouping logic can consume this transition 
> information directly without any additional calculation.
> In pre-TCM branches (4.x, 5.x), this transition information does not exist as 
> a first-class concept. The implementation needs to derive it by analyzing 
> {{TokenMetadata}} — computing which pending endpoints map to which existing 
> endpoints based on bootstrap tokens, leaving tokens, and the replication 
> strategy. This calculation is performed on topology changes and cached per 
> keyspace, so the hot write path only performs a map lookup.
> Feature flag:
>  * {{replica_slot_grouping_enabled}} (cassandra.yaml, runtime-toggleable via 
> JMX)
>  * Default: disabled (preserves existing behavior)
>  * When disabled: zero code path changes, no performance impact
> Compatibility:
>  * Purely additive; no wire protocol or schema changes
>  * Feature flag ensures safe rollout and rollback
>  * Can be backported to 4.x and 5.x



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to