I have seen this same behavior in the past as well and came to the same 
conclusions of where the issue is.  It would be good to write this up in a 
ticket.  Giving people the option of using the DynamicEndpointSnitch to order 
batch log replica selection could mitigate this exact issue, but may have other 
tradeoffs to batch log guarantees.

> On Dec 14, 2022, at 11:19 AM, Sarisky, Dan <dsari...@gmail.com> wrote:
> 
> We issue writes to Cassandra as logged batches(RF=3, Consistency levels=TWO, 
> QUORUM, or LOCAL_QUORUM)
> 
> On clusters of any size - a single extremely slow node causes a ~90% loss of 
> cluster-wide throughput using batched writes.  We can replicate this in the 
> lab via CPU or disk throttling.  I observe this in 3.11, 4.0, and 4.1.
> 
> It appears the mechanism in play is:
> Those logged batches are immediately written to two replica nodes and the 
> actual mutations aren't processed until those two nodes acknowledge the batch 
> statements.  Those replica nodes are selected randomly from all nodes in the 
> local data center currently up in gossip.  If a single node is slow, but 
> still thought to be up in gossip, this eventually causes every other node to 
> have all of its MutationStages to be waiting while the slow replica accepts 
> batch writes.
> 
> The code in play appears to be:
> See 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L245.
>   In the method filterBatchlogEndpoints() there is a Collections.shuffle() to 
> order the endpoints and a FailureDetector.isEndpointAlive() to test if the 
> endpoint is acceptable.
> 
> This behavior causes Cassandra to move from a multi-node fault tolerant 
> system toa collection of single points of failure.
> 
> We try to take administrator actions to kill off the extremely slow nodes, 
> but it would be great to have some notion of "what node is a bad choice" when 
> writing log batches to replica nodes.
> 

Reply via email to