22.01.2019 20:00, Ken Gaillot пишет: > On Tue, 2019-01-22 at 16:52 +0100, Lentes, Bernd wrote: >> Hi, >> >> we have a new UPS which has enough charge to provide our 2-node >> cluster with the periphery (SAN, switches ...) for a resonable time. >> I'm currently thinking of the shutdown- and restart-procedure of the >> complete cluster when the power is lost and does not come back soon. >> Then cluster is provided via UPS, but that does not work infinite. So >> i have to shutdown the complete cluster. >> I have the possibility to run scripts on each node which are >> triggered by the UPS. >> >> My shutdown procedure is: >> crm -w node standby node1 >> resources are migrated to node2 >> systemctl stop pacemaker >> stops also corosync >> node is not fenced ! (because of standby ?) > > Clean shutdowns don't get fenced. As long as the exiting node can tell > the rest of the cluster that it's leaving, everything can be > coordinated gracefully. > >> systemctl poweroff >> clean shutdown of node1 >> >> crm -w node standby node2 >> clean stop of resources >> systemctl stop pacmeaker >> systemctl poweroff >> >> The scripts would be executed form node2, via ssh for node1. >> What do you think about it ? > > Good plan, though perhaps there should be some allowance for the case > in which only node1 is running when the power dies. > >> Now the restart, which makes me trouble. >> Currently i want to restart the cluster manually, because i'm not >> completly familiar with pacemaker and a bit afraid of getting >> constellations >> due to automotization i didn't think of before. >> >> I can do that from anywhere because both nodes have ILO-cards. >> >> I start e.g. node1 with power button. >> >> systemctl start corosync >> systemctl start pacemaker >> corosync and pacemaker don't start automatically, i read that >> several times as a recommendation. >> Now my first problem. Let's assume the other node is broken. But i >> still want to get >> resources running. My no-quorum-policy is ignore. That should be >> fine. But i have this setup now and don't get the resources running >> automatically. > > I'm guessing you have corosync 2's wait_for_all set (probably > implicitly by two_node). This is a safeguard for the situation where > both nodes are booted up but can't see each other. > > If you're sure the other node is down, you can disable wait_for_all > before starting the node. (I'm not sure if this can be changed while > corosync is already running.) > >> >> crm_mon says: >> ===================================================================== >> === >> Stack: corosync >> Current DC: ha-idg-1 (version 1.1.19+20180928.0d2680780-1.8- >> 1.1.19+20180928.0d2680780) - partition WITHOUT quorum >> Last updated: Tue Jan 22 15:34:19 2019 >> Last change: Tue Jan 22 13:39:14 2019 by root via crm_attribute on >> ha-idg-1 >> >> 2 nodes configured >> 13 resources configured >> >> Node ha-idg-1: online >> Node ha-idg-2: UNCLEAN (offline) >> >> Inactive resources: >> >> fence_ha-idg-2 (stonith:fence_ilo2): Stopped >> fence_ha-idg-1 (stonith:fence_ilo4): Stopped >> Clone Set: cl_share [gr_share] >> Stopped: [ ha-idg-1 ha-idg-2 ] >> vm_mausdb (ocf::heartbeat:VirtualDomain): Stopped >> vm_sim (ocf::heartbeat:VirtualDomain): Stopped >> vm_geneious (ocf::heartbeat:VirtualDomain): Stopped >> Clone Set: cl_SNMP [SNMP] >> Stopped: [ ha-idg-1 ha-idg-2 ] >> >> Node Attributes: >> * Node ha-idg-1: >> + maintenance : off >> >> Migration Summary: >> * Node ha-idg-1: >> >> Failed Fencing Actions: >> * Off of ha-idg-2 failed: delegate=, client=crmd.9938, origin=ha-idg- >> 1, >> last-failed='Tue Jan 22 15:34:17 2019' >>
This is another problem - if cluster requires stonith, it won't statr resources with another node UNCLEAN and fencing attempt apparently failed. >> Negative Location Constraints: >> loc_fence_ha-idg-1 prevents fence_ha-idg-1 from running on ha- >> idg-1 >> loc_fence_ha-idg-2 prevents fence_ha-idg-2 from running on ha- >> idg-2 >> ===================================================================== >> Cluster does not have quorum but that shouldn't be a problem. >> corosync and pacemaker are started. >> Why do the resources don't start automatically ? All target-roles are >> set to "started". >> Is it because the fencing didn't succeed ? The status of ha-idg-2 >> isn't clear for the cluster ? >> If yes, what can i do ? >> >> Bernd >> > > _______________________________________________ > Users mailing list: [email protected] > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Users mailing list: [email protected] https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
