Hello,

Not so long ago, without any doing of our own we started observing an increased 
amount of dropped read messages and we can't find an explanation for it, 
perhaps you'll have some ideas where to look to try and decipher this.

C* cluster: 7 DCs (5 of them with 18 nodes and 2 with 60)
C* version: 3.11.4
Problem: We started seeing dropped read messages as reported by 
DroppedMessage/READ mbean. Peaking at 0.5% of total reads per location, it is 
low enough to not 'feel' the effect but sparks some worry. The debug.log is 
showing selects on the primary timing out.

Interestingly this behavior started in one location and then in the span of few 
weeks all locations started displaying the same stuff. Note that the load has 
not changed. Also, I think it is worth mentioning that a very low amount of 
hints (kilobytes of data) is constantly circulating. Whatever issue this is, it 
could be present in the write path as well.


  *   We are not hitting any hardware resource constraints.
  *   We've looked into whether this is related to some specific data but it is 
not so.
  *   We tried tracing a bunch queries but tracing messages do not implicate 
anything interesting and if the read request times out, all tracing messages 
seem to be gone as well.
  *   We tried rebooting all DCs.
  *   Increasing read timeout in Cassandra settings from 500 to 1000.
  *   Increasing cassandra.max_queued_native_transport_requests to 512.

Nothing seems to give...

What do you guys think? Where would you look for answers?

Gediminas

Reply via email to