Only errors I see in the logs prior to gossip pending issue are things like
this
INFO [Messaging-EventLoop-3-32] 2022-06-02 20:29:44,833
NoSpamLogger.java:92 - /X:7000->/Y:7000-URGENT_MESSAGES-[no-channel] failed
to connect
io.netty.channel.AbstractChannel$AnnotatedConnectException:
finishConnect
Hi Gil, thanks for reaching out.Can you check Cassandra's logs to see if any uncaught exceptions are
being thrown? What you described suggests the possibility of an uncaught exception being thrown in
the Gossiper thread, preventing further tasks from making progress; however I'm not aware of any
Hey
We have a big cluster (>500 nodes, onprem, multiple datacenters, most with
vnodes=32, but some with 128), that was recently upgraded from 3.11.9 to
4.0.4. Servers are all centos 7.
We have been dealing with a few issues related to gossip since :
1 - The moment the last node in the cluster was
Was more than one node added to the cluster at the same time? I.e. did
you start a new node which will join the cluster without waiting for a
previous node finish joining the same cluster? This can happen if you
don't have "serial: 1" in your Ansible script, or don't have a proper wait.
Removi
No. It transpires that, after seeing errors when running a start.yml for
ansible, I decided to start all nodes again and when starting some assumed the
same ID as others.
I resolved this by shutting down the service on the affected nodes, removing
the data dirs. (these are all new nodes: no dat