On Wed, Aug 3, 2022 at 2:59 PM Lentes, Bernd <[email protected]> wrote: > > Hi, > > i have the following situation: > 2-node Cluster, just one node running (ha-idg-1). > The second node (ha-idg-2) is in standby. DLM monitor on ha-idg-1 times out. > Cluster tries to restart all services depending on DLM: > Aug 03 01:07:11 [19367] ha-idg-1 pengine: notice: LogAction: * > Recover dlm:0 ( ha-idg-1 ) > Aug 03 01:07:11 [19367] ha-idg-1 pengine: notice: LogAction: * > Restart clvmd:0 ( ha-idg-1 ) due to required dlm:0 > start > Aug 03 01:07:11 [19367] ha-idg-1 pengine: notice: LogAction: * > Restart gfs2_share:0 ( ha-idg-1 ) due to required clvmd:0 > start > Aug 03 01:07:11 [19367] ha-idg-1 pengine: notice: LogAction: * > Restart gfs2_snap:0 ( ha-idg-1 ) due to required > gfs2_share:0 start > Aug 03 01:07:11 [19367] ha-idg-1 pengine: notice: LogAction: * > Restart fs_ocfs2:0 ( ha-idg-1 ) due to required > gfs2_snap:0 start > Aug 03 01:07:11 [19367] ha-idg-1 pengine: info: LogActions: Leave > dlm:1 (Stopped) > Aug 03 01:07:11 [19367] ha-idg-1 pengine: info: LogActions: Leave > clvmd:1 (Stopped) > Aug 03 01:07:11 [19367] ha-idg-1 pengine: info: LogActions: Leave > gfs2_share:1 (Stopped) > Aug 03 01:07:11 [19367] ha-idg-1 pengine: info: LogActions: Leave > gfs2_snap:1 (Stopped) > Aug 03 01:07:11 [19367] ha-idg-1 pengine: info: LogActions: Leave > fs_ocfs2:1 (Stopped) > Aug 03 01:07:11 [19367] ha-idg-1 pengine: info: LogActions: Leave > ClusterMon-SMTP:0 (Stopped) > Aug 03 01:07:11 [19367] ha-idg-1 pengine: info: LogActions: Leave > ClusterMon-SMTP:1 (Stopped) > Aug 03 01:07:11 [19367] ha-idg-1 pengine: notice: LogAction: * > Restart vm-mausdb ( ha-idg-1 ) due to required > cl_share running > Aug 03 01:07:11 [19367] ha-idg-1 pengine: notice: LogAction: * > Restart vm-sim ( ha-idg-1 ) due to required > cl_share running > Aug 03 01:07:11 [19367] ha-idg-1 pengine: notice: LogAction: * > Restart vm-geneious ( ha-idg-1 ) due to required > cl_share running > Aug 03 01:07:11 [19367] ha-idg-1 pengine: notice: LogAction: * > Restart vm-idcc-devel ( ha-idg-1 ) due to required > cl_share running > ... > > restart of vm-mausdb failed, stop timed out: > VirtualDomain(vm-mausdb)[32415]: 2022/08/03_01:19:06 INFO: Issuing > forced shutdown (destroy) request for domain vm-mausdb. > Aug 03 01:19:11 [19365] ha-idg-1 lrmd: warning: > child_timeout_callback: vm-mausdb_stop_0 process (PID 32415) timed out > Aug 03 01:19:11 [19365] ha-idg-1 lrmd: warning: operation_finished: > vm-mausdb_stop_0:32415 - timed out after 720000ms > ... > Aug 03 01:19:14 [19367] ha-idg-1 pengine: warning: pe_fence_node: > Cluster node ha-idg-1 will be fenced: vm-mausdb failed there > Aug 03 01:19:15 [19368] ha-idg-1 crmd: notice: te_fence_node: > Requesting fencing (Off) of node ha-idg-1 | action=8 timeout=60000 > > I have two fencing resources defined. One for ha-idg-1, one for ha-idg-2. > Both are HP ILO network adapters. > I have two location constraints: both take care that the resource for fencing > node ha-idg-1 is running on ha-idg-2 and vice versa.
Such constraints are unnecessary. Let's say we have two stonith devices called "fence_dev1" and "fence_dev2" that fence nodes 1 and 2, respectively. If node 2 needs to be fenced, and fence_dev2 is running on node 2, node 1 will still use fence_dev2 to fence node 2. The current location of the stonith device only tells us which node is running the recurring monitor operation for that stonith device. The device is available to ALL nodes, unless it's disabled or it's banned from a given node. So these constraints serve no purpose in most cases. If you ban fence_dev2 from node 1, then node 1 won't be able to use fence_dev2 to fence node 2. Likewise, if you ban fence_dev1 from node 1, then node 1 won't be able to use fence_dev1 to fence itself. Usually that's unnecessary anyway, but it may be preferable to power ourselves off if we're the last remaining node and a stop operation fails. > I never thought that it's necessary that a node has to fence itself. > So now ha-idg-2 is in standby, there is no fence device to stonith ha-idg-1. If ha-idg-2 is in standby, it can still fence ha-idg-1. Since it sounds like you've banned fence_ilo_ha-idg-1 from ha-idg-1, so that it can't run anywhere when ha-idg-2 is in standby, I'm not sure off the top of my head whether fence_ilo_ha-idg-1 is available in this situation. It may not be. A solution would be to stop banning the stonith devices from their respective nodes. Surely if fence_ilo_ha-idg-1 had been running on ha-idg-1, ha-idg-2 would have been able to use it to fence ha-idg-1. (Again, I'm not sure if that's still true if ha-idg-2 is in standby **and** fence_ilo_ha-idg-1 is banned from ha-idg-1.) > Aug 03 01:19:58 [19364] ha-idg-1 stonith-ng: notice: log_operation: > Operation 'Off' [20705] (call 2 from crmd.19368) for host 'ha-idg-1' with > device 'fence_ilo_ha-idg-2' returned: 0 (OK) > So the cluster starts the resource running on ha-idg-1 and cut off ha-idg-2, > which isn't necessary. Here, it sounds like the pcmk_host_list setting is either missing or misconfigured for fence_ilo_ha-idg-2. fence_ilo_ha-idg-2 should NOT be usable for fencing ha-idg-1. fence_ilo_ha-idg-1 should be configured with pcmk_host_list=ha-idg-1, and fence_ilo_ha-idg-2 should be configured with pcmk_host_list=ha-idg-2. What happened is that ha-idg-1 used fence_ilo_ha-idg-2 to fence itself. Of course, this only rebooted ha-idg-2. But based on the stonith device configuration, pacemaker on ha-idg-1 believed that ha-idg-1 had been fenced. Hence the "allegedly just fenced" message. > > Finally the cluster seems to realize that something went wrong: > Aug 03 01:19:58 [19368] ha-idg-1 crmd: crit: > tengine_stonith_notify: We were allegedly just fenced by ha-idg-1 for > ha-idg-1! > > So my question now: is it necessary to have a fencing device that a node can > commit suicide ? > > Bernd > > -- > Bernd Lentes > System Administrator > Institute for Metabolism and Cell Death (MCD) > Building 25 - office 122 > HelmholtzZentrum München > [email protected] > phone: +49 89 3187 1241 > +49 89 3187 49123 > fax: +49 89 3187 2294 > http://www.helmholtz-muenchen.de/mcd > > Public key: > 30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c > 3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc > 96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 > f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e > ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 > 51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 > f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 > ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 > e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 > 6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 > 38 43 0e 72 af 02 03 01 00 01 > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Regards, Reid Wahl (He/Him) Senior Software Engineer, Red Hat RHEL High Availability - Pacemaker _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
