Re: [ClusterLabs] temporary loss of quorum when member starts to rejoin

Sherrard Burton Mon, 06 Apr 2020 10:09:56 -0700


On 4/6/20 12:35 PM, Andrei Borzenkov wrote:

06.04.2020 17:05, Sherrard Burton пишет:

...or at least that's that i think is happening :-)

two-node cluster, plus quorum-only node. testing the behavior when
active node is gracefully rebooted. all seems well initially. resources
are migrated, come up and function as expected.

but, when the rebooted node starts to come back up, the other node seems
to lose quorum temporarily, even though it still has communication with
the quorum node. this causes the resources to stop until quorum is
reestablished.

summary:
active node: xen-nfs01 192.168.250.50
standby node: xen-nfs02 192.168.250.51
quorum node: xen-quorum 192.168.250.52

issue reboot on xen-nfs01
xen-nfs02 becomes active node

xen-nfs01 starts to come back online
xen-nfs02 detects loss of quorum and stops resources
xen-nfs01 finishes booting
quorum is reestablished


instead of overinundating you with all of the debugging output from
corosync, pacemaker and corosync-qnetd on all nodes, i'll start with the
basics, and provide whatever else is needed on request.


Well, to sensibly interpret logs IP address of each and corosync
configuration are needed at the very least.


node IPs provided above.
corosync conf:

root@xen-nfs01:~# grep -vF -e '#' /etc/corosync/corosync.conf | grep -vFx ''
totem {
        version: 2
        cluster_name: xen-nfs01_xen-nfs02
        crypto_cipher: aes256
        crypto_hash: sha512
}
logging {
        fileline: off
        to_stderr: yes
        to_logfile: yes
        logfile: /var/log/corosync/corosync.log
        to_syslog: yes
        debug: on
        logger_subsys {
                subsys: QUORUM
                debug: on
        }
}
nodelist {
        node {
                name: xen-nfs01
                nodeid: 1
                ring0_addr: 192.168.250.50
        }
        node {
                name: xen-nfs02
                nodeid: 2
                ring0_addr: 192.168.250.51
        }
}
quorum {
        provider: corosync_votequorum
        device {
                model: net
                votes: 1
                sync_timeout: 3000
                timeout: 1000
                net {
                        tls: on
                        host: xen-quorum
                        algorithm: ffsplit
                }
        }
}

TIA


from the node that was not rebooted:
Apr  5 23:10:15 xen-nfs02 corosync[19099]:   [KNET  ] udp: Received ICMP
error from 192.168.250.51: No route to host
Apr  5 23:10:15 xen-nfs02 corosync[19099]:   [KNET  ] udp: Received ICMP
error from 192.168.250.51: No route to host
Apr  5 23:10:16 xen-nfs02 corosync[19099]:   [KNET  ] udp: Received ICMP
error from 192.168.250.50: Connection refused
Apr  5 23:10:16 xen-nfs02 corosync[19099]:   [KNET  ] udp: Received ICMP
error from 192.168.250.50: Connection refused
Apr  5 23:10:16 xen-nfs02 corosync[19099]:   [KNET  ] rx: host: 1 link:
0 received pong: 1
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Received vote info
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:   seq = 6
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:   vote = NACK
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:   ring id = (2.814)
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Algorithm result vote
is NACK
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Cast vote timer
remains scheduled every 500ms voting NACK.
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [VOTEQ ] flags:
quorate: Yes Leaving: No WFA Status: No First: No Qdevice: Yes
QdeviceAlive: Yes QdeviceCastVote: No QdeviceMasterWins: No
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [VOTEQ ] got nodeinfo
message from cluster node 2
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [VOTEQ ] nodeinfo
message[2]: votes: 1, expected: 3 flags: 49
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [VOTEQ ] flags:
quorate: Yes Leaving: No WFA Status: No First: No Qdevice: Yes
QdeviceAlive: Yes QdeviceCastVote: No QdeviceMasterWins: No
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [VOTEQ ]
total_votes=2, expected_votes=3
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [VOTEQ ] node 1
state=2, votes=1, expected=3
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [VOTEQ ] node 2
state=1, votes=1, expected=3
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [VOTEQ ] quorum lost,
blocking activity


qdevice decided to not cast vote to nfs02 node.

Apr 05 23:10:17 [19099] xen-nfs02 corosync notice  [QUORUM] This node is
within the non-primary component and will NOT provide any services.
Apr 05 23:10:17 [19099] xen-nfs02 corosync notice  [QUORUM] Members[1]: 2
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [QUORUM] sending
quorum notification to (nil), length = 52
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [VOTEQ ] Sending
quorum callback, quorate = 0
...
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Votequorum quorum
notify callback:
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:   Quorate = 0
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:   Node list (size = 3):
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:     0 nodeid = 1,
state = 2
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:     1 nodeid = 2,
state = 1
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:     2 nodeid = 0,
state = 0
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Algorithm decided to
send list and result vote is No change
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Sending quorum node
list seq = 13, quorate = 0
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:   Node list:
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:     0 node_id = 1,
data_center_id = 0, node_state = dead
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:     1 node_id = 2,
data_center_id = 0, node_state = member



from the quorum node:
Apr 05 23:10:17 debug   New client connected
Apr 05 23:10:17 debug     cluster name = xen-nfs01_xen-nfs02
Apr 05 23:10:17 debug     tls started = 1
Apr 05 23:10:17 debug     tls peer certificate verified = 1
Apr 05 23:10:17 debug     node_id = 1
Apr 05 23:10:17 debug     pointer = 0x55b37c2d74f0
Apr 05 23:10:17 debug     addr_str = ::ffff:192.168.250.50:54462
Apr 05 23:10:17 debug     ring id = (1.814)
Apr 05 23:10:17 debug     cluster dump:
Apr 05 23:10:17 debug       client = ::ffff:192.168.250.51:54876,
node_id = 2
Apr 05 23:10:17 debug       client = ::ffff:192.168.250.50:54462,
node_id = 1
Apr 05 23:10:17 debug   Client ::ffff:192.168.250.50:54462 (cluster
xen-nfs01_xen-nfs02, node_id 1) sent initial node list.
Apr 05 23:10:17 debug     msg seq num = 4
Apr 05 23:10:17 debug     node list:
Apr 05 23:10:17 debug       node_id = 1, data_center_id = 0, node_state
= not set
Apr 05 23:10:17 debug       node_id = 2, data_center_id = 0, node_state
= not set
Apr 05 23:10:17 debug   Algorithm result vote is Ask later
Apr 05 23:10:17 debug   Client ::ffff:192.168.250.50:54462 (cluster
xen-nfs01_xen-nfs02, node_id 1) sent membership node list.
Apr 05 23:10:17 debug     msg seq num = 5
Apr 05 23:10:17 debug     ring id = (1.814)
Apr 05 23:10:17 debug     heuristics = Undefined
Apr 05 23:10:17 debug     node list:
Apr 05 23:10:17 debug       node_id = 1, data_center_id = 0, node_state
= not set
Apr 05 23:10:17 debug   ffsplit: Membership for cluster
xen-nfs01_xen-nfs02 is now stable
Apr 05 23:10:17 debug   ffsplit: Quorate partition selected
Apr 05 23:10:17 debug     node list:
Apr 05 23:10:17 debug       node_id = 1, data_center_id = 0, node_state
= not set
Apr 05 23:10:17 debug   Sending vote info to client
::ffff:192.168.250.51:54876 (cluster xen-nfs01_xen-nfs02, node_id 2)
Apr 05 23:10:17 debug     msg seq num = 6
Apr 05 23:10:17 debug     vote = NACK
Apr 05 23:10:17 debug   Algorithm result vote is No change
Apr 05 23:10:17 debug   Client ::ffff:192.168.250.50:54462 (cluster
xen-nfs01_xen-nfs02, node_id 1) sent quorum node list.
Apr 05 23:10:17 debug     msg seq num = 6
Apr 05 23:10:17 debug     quorate = 0
Apr 05 23:10:17 debug     node list:
Apr 05 23:10:17 debug       node_id = 1, data_center_id = 0, node_state
= member


Oops. How comes that node that was rebooted formed cluster all by
itself, without seeing the second node? Do you have two_nodes and/or
wait_for_all configured?

neither. i removed two_node when i added the quorum node. i was notpreviously familiar with wait_for_all.

Apr 05 23:10:17 debug   Algorithm result vote is No change
Apr 05 23:10:17 debug   Client ::ffff:192.168.250.51:54876 (cluster
xen-nfs01_xen-nfs02, node_id 2) replied back to vote info message
Apr 05 23:10:17 debug     msg seq num = 6
Apr 05 23:10:17 debug   ffsplit: All NACK votes sent for cluster
xen-nfs01_xen-nfs02
Apr 05 23:10:17 debug   Sending vote info to client
::ffff:192.168.250.50:54462 (cluster xen-nfs01_xen-nfs02, node_id 1)
Apr 05 23:10:17 debug     msg seq num = 1
Apr 05 23:10:17 debug     vote = ACK
Apr 05 23:10:17 debug   Client ::ffff:192.168.250.50:54462 (cluster
xen-nfs01_xen-nfs02, node_id 1) replied back to vote info message
Apr 05 23:10:17 debug     msg seq num = 1
Apr 05 23:10:17 debug   ffsplit: All ACK votes sent for cluster
xen-nfs01_xen-nfs02
Apr 05 23:10:17 debug   Client ::ffff:192.168.250.51:54876 (cluster
xen-nfs01_xen-nfs02, node_id 2) sent quorum node list.
Apr 05 23:10:17 debug     msg seq num = 13
Apr 05 23:10:17 debug     quorate = 0
Apr 05 23:10:17 debug     node list:
Apr 05 23:10:17 debug       node_id = 1, data_center_id = 0, node_state
= dead
Apr 05 23:10:17 debug       node_id = 2, data_center_id = 0, node_state
= member
Apr 05 23:10:17 debug   Algorithm result vote is No change

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] temporary loss of quorum when member starts to rejoin

Reply via email to