I've followed several tutorials about setting up a simple three-node cluster, with no resources (yet), under CentOS 7.
I've discovered the cluster won't restart upon rebooting a node. The other two nodes, however, do claim the cluster is up, as shown with 'pcs status cluster'. I tracked down that on the rebooted node, corosync exited with a '0' status. Nothing outright seems to be what I would call an error message, but this was recorded: [MAIN ] Corosync main process was not scheduled for 2145.7053 ms (threshold is 1320.0000 ms). Consider token timeout increase. This seems related: https://access.redhat.com/solutions/1217663 High Availability cluster node logs the message "Corosync main process was not scheduled for X ms (threshold is Y ms). Consider token timeout increase." I've confirmed that corosync is running with the maximum realtime scheduling priority: [root@node1 ~]# ps -eo cmd,rtprio | grep -e [c]orosync -e RTPRIO CMD RTPRIO corosync 99 I am doing my testing in an admittedly underprovisioned VM environment. I've used this same environment for CentOS 6 / heartbeat-based solutions, and they were nowhere near as sensitive to these timing issues. Manually running 'pcs cluster start' does indeed fire everything up without a hitch, and remains running for days at a crack. The 'consider token timeout increase' message has me looking at this: https://access.redhat.com/solutions/221263 Which makes this assertion: RHEL 7 or 8 If no token value is specified in the corosync configuration, the default is 1000 ms, or 1 second for a 2 node cluster, increasing by 650ms for each additional member. I have a three-node cluster, and the arithmetic for totem.token seems to hold: [root@node3 ~]# corosync-cmapctl | grep totem.token runtime.config.totem.token (u32) = 1650 runtime.config.totem.token_retransmit (u32) = 392 runtime.config.totem.token_retransmits_before_loss_const (u32) = 4 I'm confused on a number of issues: - The 'totem.token' value of 1650 doesn't seem to related to the threshold number in the diagnostic message the corosync service logged: threshold is 1320.0000 ms Can someone explain the relationship between these values? - If I manually set 'totem.token' to a higher value, am I responsible for tracking the number of nodes in the cluster, to keep in alignment with what Red Hat's page says? - Under these conditions, when corosync exits, why does it do so with a zero status? It seems to me that if it exited at all, without someone controllably stopping the service, it warrants a non-zero status. - Is there a recommended way to alter either pacemaker/corosync or systemd configuration of these services to harden against resource issues? I don't know if corosync's startup can be deferred until the CPU load settles, or if the some automatic retry can be set up... Details of my environment; I'm happy to provide others, if anyone has any specific questions: [root@node1 ~]# cat /etc/centos-release CentOS Linux release 7.6.1810 (Core) [root@node1 ~]# rpm -qa | egrep 'pacemaker|corosync' corosynclib-2.4.3-4.el7.x86_64 pacemaker-cluster-libs-1.1.19-8.el7_6.4.x86_64 corosync-2.4.3-4.el7.x86_64 pacemaker-cli-1.1.19-8.el7_6.4.x86_64 pacemaker-1.1.19-8.el7_6.4.x86_64 pacemaker-libs-1.1.19-8.el7_6.4.x86_64 -- Brian Reichert <[email protected]> BSD admin/developer at large _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
