Bug#970084: corosync: Corosync becomes unresponsive and disconnects from the rest of the cluster when primary link is lost

Eugen Wick Fri, 11 Sep 2020 02:28:10 -0700

Package: corosync
Version: 3.0.1-2+deb10u1
Severity: important

Dear Maintainer,


* What led up to the situation?
** 2 Node Cluster Corosync 3.0.1 on Debian Buster.
** 2 Knet Links - ring0 on eth0 (front facing if) ring1 on eth1
(back-to-back link).
** Services running on cluster-node01.
** Cluster is running just fine, both nodes are online and see each other.
** crm_mon shows 2 online nodes and running resources without errors.

* What exactly did you do (or not do) that was effective (or ineffective)?
For failover testing we disconnected the eth0 interface on the active node
(cluster-node01).

* What was the outcome of this action?
** Situation on the active node (cluster-node01)
Corosync on the node becomes unresponsive. It does not respond to commands
like corosync-cfgtool and corosync-quorumtool.
in crm_mon however the cluster status just looks fine. It claims both nodes
are online and services are healthy.
corosync logs however indicates that the cluster is disconnected.
####### corosync.log ####
Sep 11 10:06:45 [1946] cluster-node01 corosync warning [MAIN  ] Totem is
unable to form a cluster because of an operating system or network fault
(reason: totem is continuously in gather state). The most common cause of
this message is that the local firewall is configured improperly.
#########################

** Situation on the passive node (cluster-node02)
Corosync does respond to commands like corosync-cfgtool and shows that
cluster-node01 is offline on all links.

#########################
####### corosync.log #######
Sep 11 10:06:09 [1941] cluster-node02 corosync info    [KNET  ] link: host:
1 link: 0 is down
Sep 11 10:06:09 [1941] cluster-node02 corosync info    [KNET  ] host: host:
1 has 1 active links
Sep 11 10:06:10 [1941] cluster-node02 corosync notice  [TOTEM ] Token has
not been received in 2250 ms
Sep 11 10:06:11 [1941] cluster-node02 corosync notice  [TOTEM ] A processor
failed, forming new configuration.
Sep 11 10:06:15 [1941] cluster-node02 corosync notice  [TOTEM ] A new
membership (2:16) was formed. Members left: 1
Sep 11 10:06:15 [1941] cluster-node02 corosync notice  [TOTEM ] Failed to
receive the leave message. failed: 1
Sep 11 10:06:15 [1941] cluster-node02 corosync warning [CPG   ] downlist
left_list: 1 received
Sep 11 10:06:15 [1941] cluster-node02 corosync notice  [QUORUM] Members[1]:
2
Sep 11 10:06:15 [1941] cluster-node02 corosync notice  [MAIN  ] Completed
service synchronization, ready to provide service.
Sep 11 10:06:16 [1941] cluster-node02 corosync info    [KNET  ] link: host:
1 link: 1 is down
Sep 11 10:06:16 [1941] cluster-node02 corosync info    [KNET  ] host: host:
1 has 0 active links
Sep 11 10:06:16 [1941] cluster-node02 corosync warning [KNET  ] host: host:
1 has no active links
#########################

#########################
## corosync-cfgtool -s ##
root@cluster-node02:~# corosync-cfgtool -s
Printing link status.
Local node ID 2
LINK ID 0
    addr    = ###.###.###.###
    status:
        node 0: link enabled:1  link connected:0
        node 1: link enabled:1  link connected:1
LINK ID 1
    addr    = ###.###.###.###
    status:
        node 0: link enabled:1  link connected:1
        node 1: link enabled:0  link connected:1
#########################


#########################
##### crm_mon -rfA1 #######
root@cluster-node02:~# crm_mon -rfA1
Stack: corosync
Current DC: cluster-node02 (version 2.0.1-9e909a5bdd) - partition with
quorum
Last updated: Fri Sep 11 10:45:53 2020
Last change: Fri Sep 11 10:42:26 2020 by root via cibadmin on cluster-node02

2 nodes configured
7 resources configured

Online: [ cluster-node02 ]
OFFLINE: [ cluster-node01 ]
#########################

Pacemaker does therefore try to perform a failover.

* What outcome did you expect instead?
With our configuration the cluster should not take any action and both
nodes should see each other on link1.

* Tests with Corosync 3.0.3 from debian testing.
We installed packages from debian testing and fulfilled dependencies from
debian backports.

#########################
apt install libnozzle1=1.16-2~bpo10+1 libknet1=1.16-2~bpo10+1 libnl-3-200
libnl-route-3-200 libknet-dev=1.16-2~bpo10+1 ./corosync_3.0.3-2_amd64.deb
./libcorosync-common4_3.0.3-2_amd64.deb
#########################
The described problem does not occur with the 3.0.3 version from debian
testing.


-- System Information:
Debian Release: 10.5
  APT prefers stable
  APT policy: (550, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 4.19.0-10-amd64 (SMP w/2 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=C.UTF-8 (charmap=UTF-8),
LANGUAGE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)

Versions of packages corosync depends on:
ii  adduser              3.118
ii  init-system-helpers  1.56+nmu1
ii  libc6                2.28-10
ii  libcfg7              3.0.1-2+deb10u1
ii  libcmap4             3.0.1-2+deb10u1
ii  libcorosync-common4  3.0.1-2+deb10u1
ii  libcpg4              3.0.1-2+deb10u1
ii  libknet1             1.8-2
ii  libqb0               1.0.5-1
ii  libquorum5           3.0.1-2+deb10u1
ii  libstatgrab10        0.91-1+b2
ii  libsystemd0          241-7~deb10u4
ii  libvotequorum8       3.0.1-2+deb10u1
ii  lsb-base             10.2019051400
ii  xsltproc             1.1.32-2.2~deb10u1

corosync recommends no packages.

corosync suggests no packages.

-- Configuration Files:
/etc/corosync/corosync.conf changed:
totem {
    version: 2
    cluster_name: debian
    token: 3000
    token_retransmits_before_loss_const: 10
    crypto_model: nss
    crypto_cipher: aes256
    crypto_hash: sha256
    link_mode: active
    keyfile: /etc/corosync/authkey
}
nodelist {
    node {
        nodeid: 1
        name: cluster-node01
        ring0_addr: ###.###.###.142
        ring1_addr: 192.168.14.1
    }
    node {
        nodeid: 2
        name: cluster-node02
        ring0_addr: ###.###.###.143
        ring1_addr: 192.168.14.2
    }
}
logging {
    fileline: off
    to_stderr: no
    to_syslog: no
    to_logfile: yes
    logfile: /var/log/corosync/corosync.log
    debug: off
    logger_subsys {
        subsys: QUORUM
        debug: off
    }
}
quorum {
    provider: corosync_votequorum
    expected_votes: 2
    two_node: 1
    wait_for_all: 1
    auto_tie_breaker: 0
}


-- no debconf information

Bug#970084: corosync: Corosync becomes unresponsive and disconnects from the rest of the cluster when primary link is lost

Reply via email to