Hi everyone. As a followup, I found that the vms were having snapshot backup at the time of the disconnects which I think freezes IO. We'll be addressing that. Is there anything else in the log that can be improved.
Thanks, Howard On Wed, Jun 10, 2020 at 10:06 AM Howard <[email protected]> wrote: > Good morning. Thanks for reading. We have a requirement to provide high > availability for PostgreSQL 10. I have built a two node cluster with a > quorum device as the third vote, all running on RHEL 8. > > Here are the versions installed: > [postgres@srv2 cluster]$ rpm -qa|grep > "pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf" > corosync-3.0.2-3.el8_1.1.x86_64 > corosync-qdevice-3.0.0-2.el8.x86_64 > corosync-qnetd-3.0.0-2.el8.x86_64 > corosynclib-3.0.2-3.el8_1.1.x86_64 > fence-agents-vmware-soap-4.2.1-41.el8.noarch > pacemaker-2.0.2-3.el8_1.2.x86_64 > pacemaker-cli-2.0.2-3.el8_1.2.x86_64 > pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64 > pacemaker-libs-2.0.2-3.el8_1.2.x86_64 > pacemaker-schemas-2.0.2-3.el8_1.2.noarch > pcs-0.10.2-4.el8.x86_64 > resource-agents-paf-2.3.0-1.noarch > > These are vmare VMs so I configured the cluster to use the ESX host as the > fencing device using fence_vmware_soap. > > Throughout each day things generally work very well. The cluster remains > online and healthy. Unfortunately, when I check pcs status in the mornings, > I see that all kinds of things went wrong overnight. It is hard to > pinpoint what the issue is as there is so much information being written to > the pacemaker.log. Scrolling through pages and pages of informational log > entries trying to find the lines that pertain to the issue. Is there a way > to separate the logs out to make it easier to scroll through? Or maybe a > list of keywords to GREP for? > > It is clearly indicating that the server lost contact with the other node > and also the quorum device. Is there a way to make this configuration more > robust or able to recover from a connectivity blip? > > Here are the pacemaker and corosync logs for this morning's failures: > pacemaker.log > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemakerd > [10573] (pcmk_quorum_notification) warning: Quorum lost | > membership=952 members=1 > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemaker-controld > [10579] (pcmk_quorum_notification) warning: Quorum lost | > membership=952 members=1 > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (pe_fence_node) warning: Cluster node srv1 > will be fenced: peer is no longer part of the cluster > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (determine_online_status) warning: Node > srv1 is unclean > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (custom_action) warning: Action > pgsqld:1_demote_0 on srv1 is unrunnable (offline) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (custom_action) warning: Action > pgsqld:1_stop_0 on srv1 is unrunnable (offline) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (custom_action) warning: Action > pgsqld:1_demote_0 on srv1 is unrunnable (offline) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (custom_action) warning: Action > pgsqld:1_stop_0 on srv1 is unrunnable (offline) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (custom_action) warning: Action > pgsqld:1_demote_0 on srv1 is unrunnable (offline) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (custom_action) warning: Action > pgsqld:1_stop_0 on srv1 is unrunnable (offline) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (custom_action) warning: Action > pgsqld:1_demote_0 on srv1 is unrunnable (offline) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (custom_action) warning: Action > pgsqld:1_stop_0 on srv1 is unrunnable (offline) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (custom_action) warning: Action > pgsql-master-ip_stop_0 on srv1 is unrunnable (offline) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (stage6) warning: Scheduling Node srv1 > for STONITH > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2 > pacemaker-schedulerd[10578] (pcmk__log_transition_summary) warning: > Calculated transition 2 (with warnings), saving inputs in > /var/lib/pacemaker/pengine/pe-warn-34.bz2 > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld > [10579] (crmd_ha_msg_filter) warning: Another DC detected: srv1 > (op=join_offer) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld > [10579] (destroy_action) warning: Cancelling timer for action 3 > (src=307) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld > [10579] (destroy_action) warning: Cancelling timer for action 2 > (src=308) > /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld > [10579] (do_log) warning: Input I_RELEASE_DC received in state > S_RELEASE_DC from do_election_count_vote > /var/log/pacemaker/pacemaker.log:pgsqlms(pgsqld)[1164379]: Jun 10 > 00:07:19 WARNING: No secondary connected to the master > /var/log/pacemaker/pacemaker.log:Sent 5 probes (5 broadcast(s)) > /var/log/pacemaker/pacemaker.log:Received 0 response(s) > > corosync.log > Jun 10 00:06:41 [10558] srv2 corosync warning [MAIN ] Corosync main > process was not scheduled for 13006.0615 ms (threshold is 800.0000 ms). > Consider token timeout increase. > Jun 10 00:06:41 [10558] srv2 corosync notice [TOTEM ] Token has not been > received in 12922 ms > Jun 10 00:06:41 [10558] srv2 corosync notice [TOTEM ] A processor failed, > forming new configuration. > Jun 10 00:06:41 [10558] srv2 corosync info [VOTEQ ] lost contact with > quorum device Qdevice > Jun 10 00:06:41 [10558] srv2 corosync info [KNET ] link: host: 1 link: > 0 is down > Jun 10 00:06:41 [10558] srv2 corosync info [KNET ] host: host: 1 > (passive) best link: 0 (pri: 1) > Jun 10 00:06:41 [10558] srv2 corosync warning [KNET ] host: host: 1 has > no active links > Jun 10 00:06:42 [10558] srv2 corosync info [KNET ] rx: host: 1 link: 0 > is up > Jun 10 00:06:42 [10558] srv2 corosync info [KNET ] host: host: 1 > (passive) best link: 0 (pri: 1) > Jun 10 00:06:42 [10558] srv2 corosync info [VOTEQ ] waiting for quorum > device Qdevice poll (but maximum for 30000 ms) > Jun 10 00:06:42 [10558] srv2 corosync notice [TOTEM ] A new membership > (2:952) was formed. Members left: 1 > Jun 10 00:06:42 [10558] srv2 corosync notice [TOTEM ] Failed to receive > the leave message. failed: 1 > Jun 10 00:06:42 [10558] srv2 corosync warning [CPG ] downlist left_list: > 1 received > Jun 10 00:06:42 [10558] srv2 corosync notice [QUORUM] This node is within > the non-primary component and will NOT provide any services. > Jun 10 00:06:42 [10558] srv2 corosync notice [QUORUM] Members[1]: 2 > Jun 10 00:06:42 [10558] srv2 corosync notice [MAIN ] Completed service > synchronization, ready to provide service. > Jun 10 00:06:42 [10558] srv2 corosync notice [QUORUM] This node is within > the primary component and will provide service. > Jun 10 00:06:42 [10558] srv2 corosync notice [QUORUM] Members[1]: 2 > Jun 10 00:06:43 [10558] srv2 corosync info [VOTEQ ] waiting for quorum > device Qdevice poll (but maximum for 30000 ms) > Jun 10 00:06:43 [10558] srv2 corosync notice [TOTEM ] A new membership > (1:960) was formed. Members joined: 1 > Jun 10 00:06:43 [10558] srv2 corosync warning [CPG ] downlist left_list: > 0 received > Jun 10 00:06:43 [10558] srv2 corosync warning [CPG ] downlist left_list: > 0 received > Jun 10 00:06:45 [10558] srv2 corosync notice [QUORUM] Members[2]: 1 2 > Jun 10 00:06:45 [10558] srv2 corosync notice [MAIN ] Completed service > synchronization, ready to provide service. > Jun 10 00:06:45 [10558] srv2 corosync warning [MAIN ] Corosync main > process was not scheduled for 1747.0415 ms (threshold is 800.0000 ms). > Consider token timeout increase. > Jun 10 00:06:45 [10558] srv2 corosync info [VOTEQ ] waiting for quorum > device Qdevice poll (but maximum for 30000 ms) > Jun 10 00:06:45 [10558] srv2 corosync notice [TOTEM ] A new membership > (1:964) was formed. Members > Jun 10 00:06:45 [10558] srv2 corosync warning [CPG ] downlist left_list: > 0 received > Jun 10 00:06:45 [10558] srv2 corosync warning [CPG ] downlist left_list: > 0 received > Jun 10 00:06:45 [10558] srv2 corosync notice [QUORUM] Members[2]: 1 2 > Jun 10 00:06:45 [10558] srv2 corosync notice [MAIN ] Completed service > synchronization, ready to provide service. > Jun 10 00:06:52 [10558] srv2 corosync notice [TOTEM ] Token has not been > received in 750 ms > Jun 10 00:06:52 [10558] srv2 corosync info [KNET ] link: host: 1 link: > 0 is down > Jun 10 00:06:52 [10558] srv2 corosync info [KNET ] host: host: 1 > (passive) best link: 0 (pri: 1) > Jun 10 00:06:52 [10558] srv2 corosync warning [KNET ] host: host: 1 has > no active links > Jun 10 00:06:52 [10558] srv2 corosync notice [TOTEM ] A processor failed, > forming new configuration. > Jun 10 00:06:53 [10558] srv2 corosync info [VOTEQ ] waiting for quorum > device Qdevice poll (but maximum for 30000 ms) > Jun 10 00:06:53 [10558] srv2 corosync notice [TOTEM ] A new membership > (2:968) was formed. Members left: 1 > Jun 10 00:06:53 [10558] srv2 corosync notice [TOTEM ] Failed to receive > the leave message. failed: 1 > Jun 10 00:06:53 [10558] srv2 corosync warning [CPG ] downlist left_list: > 1 received > Jun 10 00:07:17 [10558] srv2 corosync notice [QUORUM] Members[1]: 2 > Jun 10 00:07:17 [10558] srv2 corosync notice [MAIN ] Completed service > synchronization, ready to provide service. > Jun 10 00:08:56 [10558] srv2 corosync notice [TOTEM ] Token has not been > received in 750 ms > Jun 10 00:09:04 [10558] srv2 corosync warning [MAIN ] Corosync main > process was not scheduled for 4477.0459 ms (threshold is 800.0000 ms). > Consider token timeout increase. > Jun 10 00:09:13 [10558] srv2 corosync warning [MAIN ] Corosync main > process was not scheduled for 5302.9785 ms (threshold is 800.0000 ms). > Consider token timeout increase. > Jun 10 00:09:13 [10558] srv2 corosync notice [TOTEM ] Token has not been > received in 5295 ms > Jun 10 00:09:13 [10558] srv2 corosync notice [TOTEM ] A processor failed, > forming new configuration. > Jun 10 00:09:13 [10558] srv2 corosync info [VOTEQ ] waiting for quorum > device Qdevice poll (but maximum for 30000 ms) > Jun 10 00:09:13 [10558] srv2 corosync notice [TOTEM ] A new membership > (2:972) was formed. Members > Jun 10 00:09:13 [10558] srv2 corosync warning [CPG ] downlist left_list: > 0 received > Jun 10 00:09:13 [10558] srv2 corosync notice [QUORUM] Members[1]: 2 > Jun 10 00:09:13 [10558] srv2 corosync notice [MAIN ] Completed service > synchronization, ready to provide service. > > Thanks, > Howard > >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
