On 2021-05-14 6:06 p.m., [email protected] wrote: > On Fri, 2021-05-14 at 15:04 -0400, Digimer wrote: >> Hi all, >> >> I'm run into an issue a couple of times now, and I'm not really >> sure >> what's causing it. I've got a RHEL 8 cluster that, after a while, >> will >> show one or more resources as 'FAILED'. When I try to do a cleanup, >> it >> marks the resources as stopped, despite them still running. After >> that, >> all attempts to manage the resources cause no change. The pcs command >> seems to have no effect, and in some cases refuses to return. >> >> The logs from the nodes (filtered for 'pcs' and 'pacem' since boot) >> are >> here (resources running on node 2): >> >> - >> https://www.alteeve.com/files/an-a02n01.pacemaker_hang.2021-05-14.txt > > The SNMP fence agent fails to start: > > May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]: warning: > fence_apc_snmp[12842] stderr: [ ] > May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]: warning: > fence_apc_snmp[12842] stderr: [ ] > May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]: warning: > fence_apc_snmp[12842] stderr: [ 2021-05-12 23:29:25,955 ERROR: Please use > '-h' for usage ] > May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]: warning: > fence_apc_snmp[12842] stderr: [ ] > May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]: notice: > Operation 'monitor' [12842] for device 'apc_snmp_node2_an-pdu02' returned: > -201 (Generic Pacemaker error) > May 12 23:29:25 an-a02n01.alteeve.com pacemaker-controld[5951]: notice: > Result of start operation for apc_snmp_node2_an-pdu02 on an-a02n01: error
I noticed this, but I have no idea why it would have failed... The 'fence_apc_snmp' is the bog-standard fence agent... > which is fatal (because start-failure-is-fatal=true): > > May 12 23:29:26 an-a02n01.alteeve.com pacemaker-attrd[5949]: notice: Setting > fail-count-apc_snmp_node2_an-pdu01#start_0[an-a02n02]: (unset) -> INFINITY > May 12 23:29:26 an-a02n01.alteeve.com pacemaker-attrd[5949]: notice: Setting > last-failure-apc_snmp_node2_an-pdu01#start_0[an-a02n02]: (unset) -> 1620876566 > > That happens for both devices on both nodes, so they get stopped > (successfully), which effectively disables them from being used, though > I don't see them needed in these logs so it wouldn't matter. So a monitor failure on the fence agent rendered the cluster effectively unresponsive? How would I normally recover from this? > It looks like you did a cleanup here: > > May 14 14:19:30 an-a02n01.alteeve.com pacemaker-controld[5951]: notice: > Forcing the status of all resources to be redetected > > It's hard to tell what happened after that without the detail log > (/var/log/pacemaker/pacemaker.log). The resource history should have > been wiped from the CIB, and probes of everything should have been > scheduled and executed. But I don't see any scheduler output, which is > odd. Next time I start the cluster, I will truncate the pacemaker log. Then if/when it fails again (seems to be happening regularly) I'll provide the pacemaker.log file. > Then we get a shutdown request, but the node has already left without > getting the OK to do so: > > May 14 14:22:58 an-a02n01.alteeve.com pacemaker-attrd[5949]: notice: Setting > shutdown[an-a02n02]: (unset) -> 1621016578 > May 14 14:42:58 an-a02n01.alteeve.com pacemaker-controld[5951]: warning: > Stonith/shutdown of node an-a02n02 was not expected > May 14 14:42:58 an-a02n01.alteeve.com pacemaker-attrd[5949]: notice: Node > an-a02n02 state is now lost > > The log ends there so I'm not sure what happens after that. I'd expect > this node to want to fence the other one. Since the fence devices are > failed, that can't happen, so that could be why the node is unable to > shut down itself. > >> - >> https://www.alteeve.com/files/an-a02n02.pacemaker_hang.2021-05-14.txt >> >> For example, it took 20 minutes for the 'pcs cluster stop' to >> complete. (Note that I tried restarting the pcsd daemon while >> waiting) >> >> BTW, I see the errors about fence_delay metadata, that will be >> fixed >> and I don't believe it's related. >> >> Any advice on what happened, how to avoid it, and how to clean up >> without a full cluster restart, should it happen again? >> > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
