Thank you for raising this, Sam! Agreed this is a bug that warrants releasing 4.0.1 and notifying user@.
To elaborate on impact, this issue can produce a state in rolling 3.x -> 4.0 upgrades in which 4.0 nodes fail to serialize gossip state during the shadow round once the size of this state exceeds 128kb. This prevents new instances from coming up. Once in this state, it is also not possible for new instances to start up and join the ring. If existing 4.0 instances restart, they will also be unable to gossip and remain down. It's a pretty serious situation without an obvious way out aside from deploying this patch. We should get a new release out quickly. – Scott ________________________________________ From: Sam Tunnicliffe <s...@beobal.com> Sent: Monday, August 23, 2021 11:27 AM To: dev@cassandra.apache.org Subject: Potential issues during 4.0 upgrade Hi all, I just opened a JIRA which is relevant to those running large clusters (around the 400 node range) and who have plans to upgrade to 4.0 upgrades soon. https://issues.apache.org/jira/browse/CASSANDRA-16877 <https://issues.apache.org/jira/browse/CASSANDRA-16877> The issue is that in large clusters, the size of gossip messages sent when a node (re)starts may exceed the hard limit of the urgent message channel. This causes an error on the sender and ultimately the message is dropped. This in turn can cause startup failures and/or partial loss of availability. Fortunately, the fix is quite simple and I’ve submitted a patch that I and other contributors have been running since discovering this issue and can confirm resolves the problem. It would be great to get it reviewed and merged ASAP and then cut a 4.0.1 release. In the meantime, it may be wise to suggest that operators of large clusters hold off on any planned 4.0 upgrades. Thanks, Sam --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org