Re: Gossip issues after upgrading to 4.0.4

2022-06-06 Thread Gil Ganz
Only errors I see in the logs prior to gossip pending issue are things like this INFO [Messaging-EventLoop-3-32] 2022-06-02 20:29:44,833 NoSpamLogger.java:92 - /X:7000->/Y:7000-URGENT_MESSAGES-[no-channel] failed to connect io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect

Re: Gossip issues after upgrading to 4.0.4

2022-06-06 Thread C. Scott Andreas
Hi Gil, thanks for reaching out.Can you check Cassandra's logs to see if any uncaught exceptions are being thrown? What you described suggests the possibility of an uncaught exception being thrown in the Gossiper thread, preventing further tasks from making progress; however I'm not aware of any

Gossip issues after upgrading to 4.0.4

2022-06-06 Thread Gil Ganz
Hey We have a big cluster (>500 nodes, onprem, multiple datacenters, most with vnodes=32, but some with 128), that was recently upgraded from 3.11.9 to 4.0.4. Servers are all centos 7. We have been dealing with a few issues related to gossip since : 1 - The moment the last node in the cluster was

Re: Cluster & Nodetool

2022-06-06 Thread Bowen Song
Was more than one node added to the cluster at the same time? I.e. did you start a new node which will join the cluster without waiting for a previous node finish joining the same cluster? This can happen if you don't have "serial: 1" in your Ansible script, or don't have a proper wait. Removi

RE: Cluster & Nodetool

2022-06-06 Thread Marc Hoppins
No. It transpires that, after seeing errors when running a start.yml for ansible, I decided to start all nodes again and when starting some assumed the same ID as others. I resolved this by shutting down the service on the affected nodes, removing the data dirs. (these are all new nodes: no dat