CASSANDRA-18120 created.

On 12/14/2022 3:13 PM, Jeremiah Jordan wrote:
I have seen this same behavior in the past as well and came to the same 
conclusions of where the issue is.  It would be good to write this up in a 
ticket.  Giving people the option of using the DynamicEndpointSnitch to order 
batch log replica selection could mitigate this exact issue, but may have other 
tradeoffs to batch log guarantees.

On Dec 14, 2022, at 11:19 AM, Sarisky, Dan <dsari...@gmail.com> wrote:

We issue writes to Cassandra as logged batches(RF=3, Consistency levels=TWO, 
QUORUM, or LOCAL_QUORUM)

On clusters of any size - a single extremely slow node causes a ~90% loss of 
cluster-wide throughput using batched writes.  We can replicate this in the lab 
via CPU or disk throttling.  I observe this in 3.11, 4.0, and 4.1.

It appears the mechanism in play is:
Those logged batches are immediately written to two replica nodes and the 
actual mutations aren't processed until those two nodes acknowledge the batch 
statements.  Those replica nodes are selected randomly from all nodes in the 
local data center currently up in gossip.  If a single node is slow, but still 
thought to be up in gossip, this eventually causes every other node to have all 
of its MutationStages to be waiting while the slow replica accepts batch writes.

The code in play appears to be:
See 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L245.
  In the method filterBatchlogEndpoints() there is a Collections.shuffle() to 
order the endpoints and a FailureDetector.isEndpointAlive() to test if the 
endpoint is acceptable.

This behavior causes Cassandra to move from a multi-node fault tolerant system 
toa collection of single points of failure.

We try to take administrator actions to kill off the extremely slow nodes, but it would 
be great to have some notion of "what node is a bad choice" when writing log 
batches to replica nodes.


Reply via email to