I don't see fencing times in here -- fencing is absolutely essential. With the setup you describe, I would drop qdevice. With fencing, quorum is not strictly required in a two-node cluster (two_node should be set in corosync.conf). You can set priority-fencing-delay to reduce the chance of simultaneous fencing. For VMs, you can use fence_xvm, which is extremely quick.
On Thu, 2024-05-02 at 02:56 +0300, [email protected] wrote: > Hi All, > > I am trying to build application-specific 2-node failover cluster > using ubuntu 22, pacemaker 2.1.2 + corosync 3.1.6 and DRBD 9.2.9, > knet transport. > > For some reason I can’t use 3-node then I have to use qnetd+qdevice > 3.0.1. > > The main goal Is to protect custom app which is not cluster-aware by > itself. It is quite stateful, can’t store the state outside memory > and take some time to get converged with other parts of the system, > then the best scenario is “failover is a restart with same config”, > but each unnecessary restart is painful. So, if failover done, app > must retain on the backup node until it fail or admin push it back, > this work well with stickiness param. > > So, the goal is to detect serving node fail ASAP and restart it ASAP > on other node, using DRBD-synced config/data. ASAP means within 5-7 > sec, not 30 or more. > > I was tried different combinations of timing, and finally got > acceptable result within 5 sec for the best case. But! The case is > very unstable. > > My setup is a simple: two nodes on VM, and one more VM as arbiter > (qnetd), VMs under Proxmox and connected by net via external ethernet > switch to get closer to reality where “nodes VM” should locate as VM > on different PHY hosts in one rack. > > Then, it was adjusted for faster detect and failover. > In Corosync, left the token default 1000ms, but add > “heartbeat_failures_allowed: 3”, this made corosync catch node > failure for about 200ms (4x50ms heartbeat). > Both qnet and qdevice was run with net_heartbeat_interval_min=200 to > allow play with faster hearbeats and detects > Also, quorum.device.net has timeout: 500, sync_timeout: 3000, algo: > LMS. > > The testing is to issue “ate +%M:%S.%N && qm stop 201”, and then > check the logs on timestamp when the app started on the “backup” > host. And, when backup host boot again, the test is to check the logs > for the app was not restarted. > > Sometimes switchover work like a charm but sometimes it may delay for > dozens of seconds. > Sometimes when the primary host boot up again, secondary hold quorum > well and keep app running, sometimes quorum is lost first (and > pacemaker downs the app) and then found and pacemaker get app up > again, so unwanted restart happen. > > My investigation shows that the difference between “good” and “bad” > cases: > > Good case - all the logs clear and reasonable. > > Bad case: qnetd losing connection to second node just after the > connection to “failure” node detected and then it may take dozens of > seconds to restore it. All this time qdevice trying to connect qnetd > and fails: > > Example, host 192.168.100.1 send to failure, 100.2 is failover to: > > From qnetd: > May 01 23:30:39 arbiter corosync-qnetd[6338]: Client > ::ffff:192.168.100.1:60686 doesn't sent any message during 600ms. > Disconnecting > May 01 23:30:39 arbiter corosync-qnetd[6338]: Client > ::ffff:192.168.100.1:60686 (init_received 1, cluster bsc-test- > cluster, node_id 1) disconnect > May 01 23:30:39 arbiter corosync-qnetd[6338]: algo-lms: Client > 0x55a6fc6785b0 (cluster bsc-test-cluster, node_id 1) disconnect > May 01 23:30:39 arbiter corosync-qnetd[6338]: algo-lms: server > going down 0 > >>> This is unexpected down, at normal scenario connection persist > May 01 23:30:40 arbiter corosync-qnetd[6338]: Client > ::ffff:192.168.100.2:32790 doesn't sent any message during 600ms. > Disconnecting > May 01 23:30:40 arbiter corosync-qnetd[6338]: Client > ::ffff:192.168.100.2:32790 (init_received 1, cluster bsc-test- > cluster, node_id 2) disconnect > May 01 23:30:40 arbiter corosync-qnetd[6338]: algo-lms: Client > 0x55a6fc6363d0 (cluster bsc-test-cluster, node_id 2) disconnect > May 01 23:30:40 arbiter corosync-qnetd[6338]: algo-lms: server > going down 0 > May 01 23:30:56 arbiter corosync-qnetd[6338]: New client connected > May 01 23:30:56 arbiter corosync-qnetd[6338]: cluster name = bsc- > test-cluster > May 01 23:30:56 arbiter corosync-qnetd[6338]: tls started = 0 > May 01 23:30:56 arbiter corosync-qnetd[6338]: tls peer certificate > verified = 0 > May 01 23:30:56 arbiter corosync-qnetd[6338]: node_id = 2 > May 01 23:30:56 arbiter corosync-qnetd[6338]: pointer = > 0x55a6fc6363d0 > May 01 23:30:56 arbiter corosync-qnetd[6338]: addr_str = > ::ffff:192.168.100.2:57736 > May 01 23:30:56 arbiter corosync-qnetd[6338]: ring id = (2.801) > May 01 23:30:56 arbiter corosync-qnetd[6338]: cluster dump: > May 01 23:30:56 arbiter corosync-qnetd[6338]: client = > ::ffff:192.168.100.2:57736, node_id = 2 > May 01 23:30:56 arbiter corosync-qnetd[6338]: Client > ::ffff:192.168.100.2:57736 (cluster bsc-test-cluster, node_id 2) sent > initial node list. > May 01 23:30:56 arbiter corosync-qnetd[6338]: msg seq num = 99 > May 01 23:30:56 arbiter corosync-qnetd[6338]: Node list: > May 01 23:30:56 arbiter corosync-qnetd[6338]: 0 node_id = 1, > data_center_id = 0, node_state = not set > May 01 23:30:56 arbiter corosync-qnetd[6338]: 1 node_id = 2, > data_center_id = 0, node_state = not set > May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: cluster bsc- > test-cluster config_list has 2 nodes > May 01 23:30:56 arbiter corosync-qnetd[6338]: Algorithm result vote > is No change > May 01 23:30:56 arbiter corosync-qnetd[6338]: Client > ::ffff:192.168.100.2:57736 (cluster bsc-test-cluster, node_id 2) sent > membership node list. > May 01 23:30:56 arbiter corosync-qnetd[6338]: msg seq num = 100 > May 01 23:30:56 arbiter corosync-qnetd[6338]: ring id = (2.801) > May 01 23:30:56 arbiter corosync-qnetd[6338]: heuristics = > Undefined > May 01 23:30:56 arbiter corosync-qnetd[6338]: Node list: > May 01 23:30:56 arbiter corosync-qnetd[6338]: 0 node_id = 2, > data_center_id = 0, node_state = not set > May 01 23:30:56 arbiter corosync-qnetd[6338]: > May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: membership > list from node 2 partition (2.801) > May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-util: partition > (2.801) (0x55a6fc67e110) has 1 nodes > May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: Only 1 > partition. This is votequorum's problem, not ours > May 01 23:30:56 arbiter corosync-qnetd[6338]: Algorithm result vote > is ACK > May 01 23:30:56 arbiter corosync-qnetd[6338]: Client > ::ffff:192.168.100.2:57736 (cluster bsc-test-cluster, node_id 2) sent > quorum node list. > May 01 23:30:56 arbiter corosync-qnetd[6338]: msg seq num = 101 > May 01 23:30:56 arbiter corosync-qnetd[6338]: quorate = 0 > May 01 23:30:56 arbiter corosync-qnetd[6338]: Node list: > May 01 23:30:56 arbiter corosync-qnetd[6338]: 0 node_id = 1, > data_center_id = 0, node_state = dead > May 01 23:30:56 arbiter corosync-qnetd[6338]: 1 node_id = 2, > data_center_id = 0, node_state = member > May 01 23:30:56 arbiter corosync-qnetd[6338]: > May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: quorum node > list from node 2 partition (2.801) > May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-util: partition > (2.801) (0x55a6fc697e70) has 1 nodes > May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: Only 1 > partition. This is votequorum's problem, not ours > May 01 23:30:56 arbiter corosync-qnetd[6338]: Algorithm result vote > is ACK > May 01 23:30:56 arbiter corosync-qnetd[6338]: Client > ::ffff:192.168.100.2:57736 (cluster bsc-test-cluster, node_id 2) sent > quorum node list. > May 01 23:30:56 arbiter corosync-qnetd[6338]: msg seq num = 102 > May 01 23:30:56 arbiter corosync-qnetd[6338]: quorate = 1 > May 01 23:30:56 arbiter corosync-qnetd[6338]: Node list: > May 01 23:30:56 arbiter corosync-qnetd[6338]: 0 node_id = 1, > data_center_id = 0, node_state = dead > May 01 23:30:56 arbiter corosync-qnetd[6338]: 1 node_id = 2, > data_center_id = 0, node_state = member > May 01 23:30:56 arbiter corosync-qnetd[6338]: > May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: quorum node > list from node 2 partition (2.801) > May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-util: partition > (2.801) (0x55a6fc669dc0) has 1 nodes > May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: Only 1 > partition. This is votequorum's problem, not ours > May 01 23:30:56 arbiter corosync-qnetd[6338]: Algorithm result vote > is ACK > May 01 23:30:58 arbiter corosync-qnetd[6338]: Client closed > connection > May 01 23:30:58 arbiter corosync-qnetd[6338]: Client > ::ffff:192.168.100.2:57674 (init_received 0, cluster bsc-test- > cluster, node_id 0) disconnect > >> At this point resource start on backup host > > From qdevice: > > May 01 23:30:40 node2 corosync-qdevice[781]: Server closed connection > May 01 23:30:40 node2 corosync-qdevice[781]: algo-lms: disconnected. > quorate = 1, WFA = 0 > May 01 23:30:40 node2 corosync-qdevice[781]: algo-lms: disconnected. > reason = 22, WFA = 0 > May 01 23:30:40 node2 corosync-qdevice[781]: Algorithm result vote is > NACK > May 01 23:30:40 node2 corosync-qdevice[781]: Cast vote timer remains > scheduled every 250ms voting NACK. > May 01 23:30:40 node2 corosync-qdevice[781]: Sleeping for 161 ms > before reconnect > May 01 23:30:40 node2 corosync-qdevice[781]: Trying connect to qnetd > server arbiter:5403 (timeout = 400ms) > May 01 23:30:41 node2 corosync-qdevice[781]: Connect timeout > May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected. > quorate = 1, WFA = 0 > May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected. > reason = 27, WFA = 0 > May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm result vote is > NACK > May 01 23:30:41 node2 corosync-qdevice[781]: Cast vote timer remains > scheduled every 250ms voting NACK. > May 01 23:30:41 node2 corosync-qdevice[781]: Trying connect to qnetd > server arbiter:5403 (timeout = 400ms) > May 01 23:30:41 node2 corosync-qdevice[781]: Votequorum nodelist > notify callback: > May 01 23:30:41 node2 corosync-qdevice[781]: Ring_id = (2.801) > May 01 23:30:41 node2 corosync-qdevice[781]: Node list (size = 1): > May 01 23:30:41 node2 corosync-qdevice[781]: 0 nodeid = 2 > May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm decided to > pause cast vote timer and result vote is No change > May 01 23:30:41 node2 corosync-qdevice[781]: Cast vote timer is now > paused. > May 01 23:30:41 node2 corosync-qdevice[781]: worker: > qdevice_heuristics_worker_cmd_process_exec: Received exec command > with seq_no "24" and timeout "1500" > May 01 23:30:41 node2 corosync-qdevice[781]: Received heuristics exec > result command with seq_no "24" and result "Disabled" > May 01 23:30:41 node2 corosync-qdevice[781]: Votequorum heuristics > exec result callback: > May 01 23:30:41 node2 corosync-qdevice[781]: seq_number = 24, > exec_result = Disabled > May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm decided to not > send list, result vote is No change and heuristics is Undefined > May 01 23:30:41 node2 corosync-qdevice[781]: Cast vote timer is no > longer paused. > May 01 23:30:41 node2 corosync-qdevice[781]: Not scheduling > heuristics timer because mode is not enabled > May 01 23:30:41 node2 corosync-qdevice[781]: Votequorum quorum notify > callback: > May 01 23:30:41 node2 corosync-qdevice[781]: Quorate = 0 > May 01 23:30:41 node2 corosync-qdevice[781]: Node list (size = 3): > May 01 23:30:41 node2 corosync-qdevice[781]: 0 nodeid = 1, state > = 2 > May 01 23:30:41 node2 corosync-qdevice[781]: 1 nodeid = 2, state > = 1 > May 01 23:30:41 node2 corosync-qdevice[781]: 2 nodeid = 0, state > = 0 > May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: quorum_notify. > quorate = 0 > May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm decided to not > send list and result vote is No change > May 01 23:30:41 node2 corosync-qdevice[781]: Connect timeout > May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected. > quorate = 0, WFA = 0 > May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected. > reason = 27, WFA = 0 > May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm result vote is > NACK > May 01 23:30:41 node2 corosync-qdevice[781]: Cast vote timer remains > scheduled every 250ms voting NACK. > May 01 23:30:41 node2 corosync-qdevice[781]: Trying connect to qnetd > server arbiter:5403 (timeout = 400ms) > >>> At this point quorum reported lost > May 01 23:30:41 node2 corosync-qdevice[781]: Connect timeout > May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected. > quorate = 0, WFA = 0 > May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected. > reason = 27, WFA = 0 > May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm result vote is > NACK > May 01 23:30:41 node2 corosync-qdevice[781]: Cast vote timer remains > scheduled every 250ms voting NACK. > > >>> This failure pattern repeats 31 times > May 01 23:30:41 node2 corosync-qdevice[781]: Trying connect to qnetd > server arbiter:5403 (timeout = 400ms) > May 01 23:30:42 node2 corosync-qdevice[781]: Connect timeout > May 01 23:30:42 node2 corosync-qdevice[781]: algo-lms: disconnected. > quorate = 0, WFA = 0 > May 01 23:30:42 node2 corosync-qdevice[781]: algo-lms: disconnected. > reason = 27, WFA = 0 > May 01 23:30:42 node2 corosync-qdevice[781]: Algorithm result vote is > NACK > May 01 23:30:42 node2 corosync-qdevice[781]: Cast vote timer remains > scheduled every 250ms voting NACK. > >>> End of pattern repeat, continue > > May 01 23:30:56 node2 corosync-qdevice[781]: Trying connect to qnetd > server arbiter:5403 (timeout = 400ms) > May 01 23:30:56 node2 corosync-qdevice[781]: Sending preinit msg to > qnetd > May 01 23:30:56 node2 corosync-qdevice[781]: Received preinit reply > msg > May 01 23:30:56 node2 corosync-qdevice[781]: Received init reply msg > May 01 23:30:56 node2 corosync-qdevice[781]: Scheduling send of > heartbeat every 400ms > May 01 23:30:56 node2 corosync-qdevice[781]: Executing after-connect > heuristics. > May 01 23:30:56 node2 corosync-qdevice[781]: worker: > qdevice_heuristics_worker_cmd_process_exec: Received exec command > with seq_no "25" and timeout "250" > May 01 23:30:56 node2 corosync-qdevice[781]: Received heuristics exec > result command with seq_no "25" and result "Disabled" > May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm decided to > send config node list, send membership node list, send quorum node > list, heuristics is Undefined and result vote is Wait for reply > May 01 23:30:56 node2 corosync-qdevice[781]: Sending set option seq = > 98, HB(0) = 0ms, KAP Tie-breaker(1) = Enabled > May 01 23:30:56 node2 corosync-qdevice[781]: Sending config node list > seq = 99 > May 01 23:30:56 node2 corosync-qdevice[781]: Node list: > May 01 23:30:56 node2 corosync-qdevice[781]: 0 node_id = 1, > data_center_id = 0, node_state = not set > May 01 23:30:56 node2 corosync-qdevice[781]: 1 node_id = 2, > data_center_id = 0, node_state = not set > May 01 23:30:56 node2 corosync-qdevice[781]: Sending membership node > list seq = 100, ringid = (2.801), heuristics = Undefined. > May 01 23:30:56 node2 corosync-qdevice[781]: Node list: > May 01 23:30:56 node2 corosync-qdevice[781]: 0 node_id = 2, > data_center_id = 0, node_state = not set > May 01 23:30:56 node2 corosync-qdevice[781]: Sending quorum node list > seq = 101, quorate = 0 > May 01 23:30:56 node2 corosync-qdevice[781]: Node list: > May 01 23:30:56 node2 corosync-qdevice[781]: 0 node_id = 1, > data_center_id = 0, node_state = dead > May 01 23:30:56 node2 corosync-qdevice[781]: 1 node_id = 2, > data_center_id = 0, node_state = member > May 01 23:30:56 node2 corosync-qdevice[781]: Cast vote timer is now > stopped. > May 01 23:30:56 node2 corosync-qdevice[781]: Received set option > reply seq(1) = 98, HB(0) = 0ms, KAP Tie-breaker(1) = Enabled > May 01 23:30:56 node2 corosync-qdevice[781]: Received initial config > node list reply > May 01 23:30:56 node2 corosync-qdevice[781]: seq = 99 > May 01 23:30:56 node2 corosync-qdevice[781]: vote = No change > May 01 23:30:56 node2 corosync-qdevice[781]: ring id = (2.801) > May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm result vote is > No change > May 01 23:30:56 node2 corosync-qdevice[781]: Received membership node > list reply > May 01 23:30:56 node2 corosync-qdevice[781]: seq = 100 > May 01 23:30:56 node2 corosync-qdevice[781]: vote = ACK > May 01 23:30:56 node2 corosync-qdevice[781]: ring id = (2.801) > May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm result vote is > ACK > May 01 23:30:56 node2 corosync-qdevice[781]: Cast vote timer is now > scheduled every 250ms voting ACK. > May 01 23:30:56 node2 corosync-qdevice[781]: Received quorum node > list reply > May 01 23:30:56 node2 corosync-qdevice[781]: seq = 101 > May 01 23:30:56 node2 corosync-qdevice[781]: vote = ACK > May 01 23:30:56 node2 corosync-qdevice[781]: ring id = (2.801) > May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm result vote is > ACK > May 01 23:30:56 node2 corosync-qdevice[781]: Cast vote timer remains > scheduled every 250ms voting ACK. > May 01 23:30:56 node2 corosync-qdevice[781]: Votequorum quorum notify > callback: > May 01 23:30:56 node2 corosync-qdevice[781]: Quorate = 1 > May 01 23:30:56 node2 corosync-qdevice[781]: Node list (size = 3): > May 01 23:30:56 node2 corosync-qdevice[781]: 0 nodeid = 1, state > = 2 > May 01 23:30:56 node2 corosync-qdevice[781]: 1 nodeid = 2, state > = 1 > May 01 23:30:56 node2 corosync-qdevice[781]: 2 nodeid = 0, state > = 0 > May 01 23:30:56 node2 corosync-qdevice[781]: algo-lms: quorum_notify. > quorate = 1 > May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm decided to > send list and result vote is No change > May 01 23:30:56 node2 corosync-qdevice[781]: Sending quorum node list > seq = 102, quorate = 1 > May 01 23:30:56 node2 corosync-qdevice[781]: Node list: > May 01 23:30:56 node2 corosync-qdevice[781]: 0 node_id = 1, > data_center_id = 0, node_state = dead > May 01 23:30:56 node2 corosync-qdevice[781]: 1 node_id = 2, > data_center_id = 0, node_state = member > May 01 23:30:56 node2 corosync-qdevice[781]: Received quorum node > list reply > May 01 23:30:56 node2 corosync-qdevice[781]: seq = 102 > May 01 23:30:56 node2 corosync-qdevice[781]: vote = ACK > May 01 23:30:56 node2 corosync-qdevice[781]: ring id = (2.801) > May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm result vote is > ACK > May 01 23:30:56 node2 corosync-qdevice[781]: Cast vote timer remains > scheduled every 250ms voting ACK. > >>> Here everything become OK and resource started on Node2 > > Also, I’ve done wireshark capture and found great mess in TCP, it > seems like connection between qdevice and qnetd really stops for some > time and packets won’t deliver. > > For my guess, it match corosync syncing activities, and I suspect > that corosync prevent any other traffic on the interface it use for > rings. > > As I switch qnetd and qdevice to use different interface it seems to > work fine. > > So, the question is: does corosync really temporary blocks any other > traffic on the interface it uses? Or it is just a coincidence? If it > blocks, is there a way to manage it? > > Thank you for any suggest on that! > > Sincerely, > > Alex > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Ken Gaillot <[email protected]> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
