Hello, Not so long ago, without any doing of our own we started observing an increased amount of dropped read messages and we can't find an explanation for it, perhaps you'll have some ideas where to look to try and decipher this.
C* cluster: 7 DCs (5 of them with 18 nodes and 2 with 60) C* version: 3.11.4 Problem: We started seeing dropped read messages as reported by DroppedMessage/READ mbean. Peaking at 0.5% of total reads per location, it is low enough to not 'feel' the effect but sparks some worry. The debug.log is showing selects on the primary timing out. Interestingly this behavior started in one location and then in the span of few weeks all locations started displaying the same stuff. Note that the load has not changed. Also, I think it is worth mentioning that a very low amount of hints (kilobytes of data) is constantly circulating. Whatever issue this is, it could be present in the write path as well. * We are not hitting any hardware resource constraints. * We've looked into whether this is related to some specific data but it is not so. * We tried tracing a bunch queries but tracing messages do not implicate anything interesting and if the read request times out, all tracing messages seem to be gone as well. * We tried rebooting all DCs. * Increasing read timeout in Cassandra settings from 500 to 1000. * Increasing cassandra.max_queued_native_transport_requests to 512. Nothing seems to give... What do you guys think? Where would you look for answers? Gediminas