Hi all, I just opened a JIRA which is relevant to those running large clusters (around the 400 node range) and who have plans to upgrade to 4.0 upgrades soon.
https://issues.apache.org/jira/browse/CASSANDRA-16877 <https://issues.apache.org/jira/browse/CASSANDRA-16877> The issue is that in large clusters, the size of gossip messages sent when a node (re)starts may exceed the hard limit of the urgent message channel. This causes an error on the sender and ultimately the message is dropped. This in turn can cause startup failures and/or partial loss of availability. Fortunately, the fix is quite simple and I’ve submitted a patch that I and other contributors have been running since discovering this issue and can confirm resolves the problem. It would be great to get it reviewed and merged ASAP and then cut a 4.0.1 release. In the meantime, it may be wise to suggest that operators of large clusters hold off on any planned 4.0 upgrades. Thanks, Sam