Package: corosync Version: 3.0.1-2+deb10u1 Severity: important Dear Maintainer,
* What led up to the situation? ** 2 Node Cluster Corosync 3.0.1 on Debian Buster. ** 2 Knet Links - ring0 on eth0 (front facing if) ring1 on eth1 (back-to-back link). ** Services running on cluster-node01. ** Cluster is running just fine, both nodes are online and see each other. ** crm_mon shows 2 online nodes and running resources without errors. * What exactly did you do (or not do) that was effective (or ineffective)? For failover testing we disconnected the eth0 interface on the active node (cluster-node01). * What was the outcome of this action? ** Situation on the active node (cluster-node01) Corosync on the node becomes unresponsive. It does not respond to commands like corosync-cfgtool and corosync-quorumtool. in crm_mon however the cluster status just looks fine. It claims both nodes are online and services are healthy. corosync logs however indicates that the cluster is disconnected. ####### corosync.log #### Sep 11 10:06:45 [1946] cluster-node01 corosync warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly. ######################### ** Situation on the passive node (cluster-node02) Corosync does respond to commands like corosync-cfgtool and shows that cluster-node01 is offline on all links. ######################### ####### corosync.log ####### Sep 11 10:06:09 [1941] cluster-node02 corosync info [KNET ] link: host: 1 link: 0 is down Sep 11 10:06:09 [1941] cluster-node02 corosync info [KNET ] host: host: 1 has 1 active links Sep 11 10:06:10 [1941] cluster-node02 corosync notice [TOTEM ] Token has not been received in 2250 ms Sep 11 10:06:11 [1941] cluster-node02 corosync notice [TOTEM ] A processor failed, forming new configuration. Sep 11 10:06:15 [1941] cluster-node02 corosync notice [TOTEM ] A new membership (2:16) was formed. Members left: 1 Sep 11 10:06:15 [1941] cluster-node02 corosync notice [TOTEM ] Failed to receive the leave message. failed: 1 Sep 11 10:06:15 [1941] cluster-node02 corosync warning [CPG ] downlist left_list: 1 received Sep 11 10:06:15 [1941] cluster-node02 corosync notice [QUORUM] Members[1]: 2 Sep 11 10:06:15 [1941] cluster-node02 corosync notice [MAIN ] Completed service synchronization, ready to provide service. Sep 11 10:06:16 [1941] cluster-node02 corosync info [KNET ] link: host: 1 link: 1 is down Sep 11 10:06:16 [1941] cluster-node02 corosync info [KNET ] host: host: 1 has 0 active links Sep 11 10:06:16 [1941] cluster-node02 corosync warning [KNET ] host: host: 1 has no active links ######################### ######################### ## corosync-cfgtool -s ## root@cluster-node02:~# corosync-cfgtool -s Printing link status. Local node ID 2 LINK ID 0 addr = ###.###.###.### status: node 0: link enabled:1 link connected:0 node 1: link enabled:1 link connected:1 LINK ID 1 addr = ###.###.###.### status: node 0: link enabled:1 link connected:1 node 1: link enabled:0 link connected:1 ######################### ######################### ##### crm_mon -rfA1 ####### root@cluster-node02:~# crm_mon -rfA1 Stack: corosync Current DC: cluster-node02 (version 2.0.1-9e909a5bdd) - partition with quorum Last updated: Fri Sep 11 10:45:53 2020 Last change: Fri Sep 11 10:42:26 2020 by root via cibadmin on cluster-node02 2 nodes configured 7 resources configured Online: [ cluster-node02 ] OFFLINE: [ cluster-node01 ] ######################### Pacemaker does therefore try to perform a failover. * What outcome did you expect instead? With our configuration the cluster should not take any action and both nodes should see each other on link1. * Tests with Corosync 3.0.3 from debian testing. We installed packages from debian testing and fulfilled dependencies from debian backports. ######################### apt install libnozzle1=1.16-2~bpo10+1 libknet1=1.16-2~bpo10+1 libnl-3-200 libnl-route-3-200 libknet-dev=1.16-2~bpo10+1 ./corosync_3.0.3-2_amd64.deb ./libcorosync-common4_3.0.3-2_amd64.deb ######################### The described problem does not occur with the 3.0.3 version from debian testing. -- System Information: Debian Release: 10.5 APT prefers stable APT policy: (550, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 4.19.0-10-amd64 (SMP w/2 CPU cores) Locale: LANG=en_US.UTF-8, LC_CTYPE=C.UTF-8 (charmap=UTF-8), LANGUAGE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Init: systemd (via /run/systemd/system) Versions of packages corosync depends on: ii adduser 3.118 ii init-system-helpers 1.56+nmu1 ii libc6 2.28-10 ii libcfg7 3.0.1-2+deb10u1 ii libcmap4 3.0.1-2+deb10u1 ii libcorosync-common4 3.0.1-2+deb10u1 ii libcpg4 3.0.1-2+deb10u1 ii libknet1 1.8-2 ii libqb0 1.0.5-1 ii libquorum5 3.0.1-2+deb10u1 ii libstatgrab10 0.91-1+b2 ii libsystemd0 241-7~deb10u4 ii libvotequorum8 3.0.1-2+deb10u1 ii lsb-base 10.2019051400 ii xsltproc 1.1.32-2.2~deb10u1 corosync recommends no packages. corosync suggests no packages. -- Configuration Files: /etc/corosync/corosync.conf changed: totem { version: 2 cluster_name: debian token: 3000 token_retransmits_before_loss_const: 10 crypto_model: nss crypto_cipher: aes256 crypto_hash: sha256 link_mode: active keyfile: /etc/corosync/authkey } nodelist { node { nodeid: 1 name: cluster-node01 ring0_addr: ###.###.###.142 ring1_addr: 192.168.14.1 } node { nodeid: 2 name: cluster-node02 ring0_addr: ###.###.###.143 ring1_addr: 192.168.14.2 } } logging { fileline: off to_stderr: no to_syslog: no to_logfile: yes logfile: /var/log/corosync/corosync.log debug: off logger_subsys { subsys: QUORUM debug: off } } quorum { provider: corosync_votequorum expected_votes: 2 two_node: 1 wait_for_all: 1 auto_tie_breaker: 0 } -- no debconf information