Hi all, Just bumping this thread in case it was missed the first time.
I’ve updated *CASSANDRA-20993 <https://issues.apache.org/jira/browse/CASSANDRA-20993>* with a detailed *Correctness / Safety* section that explains why excluding the pending replacement node from blockFor during node replacement does not weaken read-after-write guarantees for any combination of write CL and read CL. The key point is that the effective number of *natural replicas* that must acknowledge a write (and be consulted for a read) is unchanged; we only stop inflating blockFor with the pending replacement. For example, in the common *RF=3, QUORUM write + QUORUM read* case, the proof shows that during a C → D replacement: - Every successful QUORUM write is still guaranteed to be stored on a quorum of naturals (e.g., A and B), and - Every QUORUM read—both before and after the replacement completes—must intersect {A, B}, so it always sees the latest value. The more general argument in the ticket covers all CL pairs and shows that the standard condition W_eff + R_eff > RF holds (or not) exactly as before; the change only removes unnecessary write timeouts when the pending replacement is slow. If you have concerns about the correctness argument, or think there are corner cases I’m missing (e.g., particular CL combinations or topology transitions), I’d really appreciate feedback on the JIRA or in this thread. Thanks, Runtian On Tue, Nov 25, 2025 at 4:44 PM Runtian Liu <[email protected]> wrote: > Hi everyone, > > I’d like to start a discussion about adjusting how Cassandra calculates > blockFor during node replacements. The JIRA tracking this proposal is > here: > https://issues.apache.org/jira/browse/CASSANDRA-20993 > Problem Background > > Today, during a replacement, the pending replica is always included when > determining the required acknowledgments. For example, with RF=3 and > LOCAL_QUORUM, the coordinator waits for three responses instead of two. > Since replacement nodes are often bootstrapping and slow to respond, this > can result in write timeouts or increased write latency—even though the > client only requested acknowledgments from the natural replicas. > > This behavior effectively breaks the client contract by requiring more > responses than the specified consistency level. > Proposed Change > > For replacement scenarios only, exclude pending replicas from blockFor > and require acknowledgments solely from natural replicas. Pending nodes > will still receive writes, but their responses will not count toward > satisfying the consistency level. > > Responses from the node being replaced would also be ignored. Although it > is uncommon for a replaced node to become reachable again, adding this > safeguard avoids ambiguity and ensures correctness if that situation occurs. > > This change would be disabled by default and controlled via a feature > flag to avoid affecting existing deployments. > > In my view, this behavior is effectively a bug because the coordinator > waits for more acknowledgments than the client requested, leading to > avoidable failures or latency. Since the issue affects correctness from the > client perspective rather than introducing new semantics, it would be > valuable to include this fix in the 4.x branches as well, with the behavior > disabled by default where needed. > Motivation > > This change: > > - > > Prevents unnecessary write timeouts during replacements > > - > > Reduces write latency by eliminating dependence on a busy pending > replica > > - > > Aligns server behavior with client expectations > > Current Status > > A PR for 4.1 is available here for review: > https://github.com/apache/cassandra/pull/4494 > > Feedback is welcome on both the implementation and the approach. > Next Steps > > I’d appreciate input on: > > 1. > > Any correctness concerns for replacement scenarios > > 2. > > Whether a feature-flagged approach is acceptable > > > Thanks in advance for your feedback, > Runtian > > >
