Re: [ClusterLabs] Question about two level STONITH/fencing

Klaus Wenninger Fri, 06 Feb 2026 06:56:10 -0800

On Fri, Feb 6, 2026 at 3:41 PM Klaus Wenninger <[email protected]> wrote:


>
>
> On Thu, Feb 5, 2026 at 8:07 PM Anton Gavriliuk <[email protected]>
> wrote:
>
>>
>>    - The other way round: pcs stonith create watchdog fence_watchdog
>>
>>
>>
>> Yes, that works, thank you! After creation it automatically started on 2
>> nd node – memverge2
>>
>>
>>
>> Cluster Summary:
>>
>>   * Stack: corosync (Pacemaker is running)
>>
>>   * Current DC: memverge2 (28) (version 3.0.1-3.el10-b1a23a6) - partition
>> with quorum
>>
>>   * Last updated: Thu Feb  5 21:02:49 2026 on memverge
>>
>>   * Last change:  Thu Feb  5 21:01:00 2026 by root via root on memverge
>>
>>   * 2 nodes configured
>>
>>   * 23 resource instances configured
>>
>>
>>
>> Node List:
>>
>>   * Node memverge (27): online, feature set 3.20.1
>>
>>   * Node memverge2 (28): online, feature set 3.20.1
>>
>>
>>
>> Full List of Resources:
>>
>>   * Resource Group: g-nfs:
>>
>>     * pb_nfs    (ocf:heartbeat:portblock):       Started memverge
>>
>>     * ip0_nfs   (ocf:heartbeat:IPaddr2):         Started memverge
>>
>>     * fs_nfs_internal_info_HA   (ocf:heartbeat:Filesystem):      Started
>> memverge
>>
>>     * fs_nfsshare_exports_HA    (ocf:heartbeat:Filesystem):      Started
>> memverge
>>
>>     * nfsserver (ocf:heartbeat:nfsserver):       Started memverge
>>
>>     * expfs_nfsshare_exports_HA (ocf:heartbeat:exportfs):        Started
>> memverge
>>
>>     * samba_service     (systemd:smb):   Started memverge
>>
>>     * fs_sambashare_exports_HA  (ocf:heartbeat:Filesystem):      Started
>> memverge
>>
>>     * punb_nfs  (ocf:heartbeat:portblock):       Started memverge
>>
>>   * Resource Group: g-iscsi:
>>
>>     * pb_iscsi  (ocf:heartbeat:portblock):       Started memverge
>>
>>     * ip0_iscsi (ocf:heartbeat:IPaddr2):         Started memverge
>>
>>     * ip1_iscsi (ocf:heartbeat:IPaddr2):         Started memverge
>>
>>     * iscsi_target      (ocf:heartbeat:iSCSITarget):     Started memverge
>>
>>     * iscsi_lun_drbd3   (ocf:heartbeat:iSCSILogicalUnit):        Started
>> memverge
>>
>>     * iscsi_lun_drbd4   (ocf:heartbeat:iSCSILogicalUnit):        Started
>> memverge
>>
>>     * punb_iscsi        (ocf:heartbeat:portblock):       Started memverge
>>
>>   * Clone Set: ha-nfs-clone [ha-nfs] (promotable):
>>
>>     * ha-nfs    (ocf:linbit:drbd):       Promoted memverge
>>
>>     * ha-nfs    (ocf:linbit:drbd):       Unpromoted memverge2
>>
>>   * Clone Set: ha-iscsi-clone [ha-iscsi] (promotable):
>>
>>     * ha-iscsi  (ocf:linbit:drbd):       Promoted memverge
>>
>>     * ha-iscsi  (ocf:linbit:drbd):       Unpromoted memverge2
>>
>>   * ipmi-fence-memverge (stonith:fence_ipmilan):         Started memverge2
>>
>>   * ipmi-fence-memverge2        (stonith:fence_ipmilan):         Started
>> memverge
>>
>>   * watchdog    (stonith:fence_watchdog):        Started memverge2
>>
>>
>>
>> But I assume I should create the same for 1st node – memverge ?
>>
>
> Probably you will not need a 2nd instance. That is as with any other
> fencing-resource where
> usually monitoring would be running. But that isn't doing anything with
> watchdog iirc anyway.
>

execution of a fencing action usually can happen wherever you don't
explicitly forbid it.
which is the reason why you should ban it from nodes where you know it
would fail for whatever reason.
watchdog is of course a bit peculiar here as the only action that happens
as with other fence-agents
is meta-data - everything else is handled within pacemaker.
That was my primary intent when I implemented the possibility to make
watchdog visible that you
could have it in a topology and that you could disable watchdog-fencing for
certain nodes - using
the usual mechanisms and high-level-tooling then.

Klaus

>
> Klaus
>
>>
>>
>> Anton
>>
>>
>>
>> *From:* Klaus Wenninger <[email protected]>
>> *Sent:* Thursday, February 5, 2026 4:16 PM
>> *To:* Anton Gavriliuk <[email protected]>
>> *Cc:* Andrei Borzenkov <[email protected]>; Cluster Labs - All topics
>> related to open-source clustering welcomed <[email protected]>
>> *Subject:* Re: [ClusterLabs] Question about two level STONITH/fencing
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Feb 5, 2026 at 3:07 PM Anton Gavriliuk <[email protected]>
>> wrote:
>>
>>
>>    - But sry again I forgot to mention that the fence-resource has to be
>>    called 'watchdog' otherwise pacemaker won't align it with the already
>>    existent (if you have stonith-watchdog-timeout != 0) internal hidden
>>    device.
>>
>>
>>
>> [root@memverge ~]# pcs stonith create watchdog-fencing watchdog
>>
>> Error: Agent 'stonith:watchdog' is not installed or does not provide
>> valid metadata: crm_resource: Metadata query for stonith:watchdog failed:
>> No such device or address, Error performing operation: No such object, use
>> --force to override
>>
>> Error: Errors have occurred, therefore pcs is unable to continue
>>
>>
>>
>> The other way round: pcs stonith create watchdog fence_watchdog
>>
>>
>>
>> [root@memverge ~]#
>>
>>
>>
>>    - Can you provide your cib & corosync-config as that we don't have to
>>    write back and forth that often?
>>
>>
>>
>> I attached it in the files.
>>
>>
>>
>> Anton
>>
>>
>>
>> *From:* Klaus Wenninger <[email protected]>
>> *Sent:* Thursday, February 5, 2026 3:42 PM
>> *To:* Anton Gavriliuk <[email protected]>
>> *Cc:* Andrei Borzenkov <[email protected]>; Cluster Labs - All topics
>> related to open-source clustering welcomed <[email protected]>
>> *Subject:* Re: [ClusterLabs] Question about two level STONITH/fencing
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Feb 5, 2026 at 2:21 PM Anton Gavriliuk <[email protected]>
>> wrote:
>>
>> I tried,
>>
>>
>>
>> [root@memverge ~]# pcs stonith create watchdog-fencing fence_watchdog
>>
>>
>>
>> But after that, the running cluster is hanging...., I can't run "crm_mon
>> -Rr", “error: Lost connection to controller”
>>
>>
>>
>> Perhaps this is due to /dev/watchdog is already managed by pacemaker ?
>>
>>
>>
>> [root@memverge ~]# systemctl status sbd
>>
>> ● sbd.service - Shared-storage based fencing daemon
>>
>>      Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled;
>> preset: disabled)
>>
>>     Drop-In: /etc/systemd/system/sbd.service.d
>>
>>              └─override.conf
>>
>>      Active: active (running) since Tue 2026-02-03 16:09:00 EET; 1 day
>> 22h ago
>>
>> Invocation: 11a9ba526ef5403682980d67a886a7b9
>>
>>        Docs: man:sbd(8)
>>
>>    Main PID: 2473 (sbd)
>>
>>       Tasks: 3 (limit: 3355442)
>>
>>      Memory: 18.8M (peak: 19.5M)
>>
>>         CPU: 2min 22.568s
>>
>>      CGroup: /system.slice/sbd.service
>>
>>              ├─2473 "sbd: inquisitor"
>>
>>              ├─2487 "sbd: watcher: Pacemaker"
>>
>>              └─2488 "sbd: watcher: Cluster"
>>
>>
>>
>> Feb 03 16:09:00 memverge sbd[2473]:   notice: inquisitor_child: Servant
>> cluster is healthy (age: 0)
>>
>> Feb 03 16:09:00 memverge sbd[2473]:   notice: watchdog_init: Using
>> watchdog device '/dev/watchdog'
>>
>> Feb 03 16:09:00 memverge systemd[1]: Started sbd.service - Shared-storage
>> based fencing daemon.
>>
>> Feb 03 16:09:04 memverge sbd[2473]:   notice: inquisitor_child: Servant
>> pcmk is healthy (age: 0)
>>
>> Feb 03 16:11:27 memverge systemd[1]:
>> /etc/systemd/system/sbd.service.d/override.conf:1: Assignment outside of
>> section. Ignoring.
>>
>> Feb 03 16:11:28 memverge systemd[1]:
>> /etc/systemd/system/sbd.service.d/override.conf:1: Assignment outside of
>> section. Ignoring.
>>
>> Feb 03 16:25:02 memverge sbd[2473]:  warning: inquisitor_child: pcmk
>> health check: UNHEALTHY
>>
>> Feb 03 16:25:02 memverge sbd[2473]:  warning: inquisitor_child: Servant
>> pcmk is outdated (age: 1246)
>>
>> Feb 03 16:25:03 memverge sbd[2473]:   notice: inquisitor_child: Servant
>> pcmk is healthy (age: 0)
>>
>> Feb 05 15:01:05 memverge systemd[1]:
>> /etc/systemd/system/sbd.service.d/override.conf:1: Assignment outside of
>> section. Ignoring.
>>
>> [root@memverge ~]#
>>
>>
>>
>> Oh.., now it opened,
>>
>>
>>
>> Cluster Summary:
>>
>>   * Stack: corosync (Pacemaker is running)
>>
>>   * Current DC: memverge (27) (version 3.0.1-3.el10-b1a23a6) -
>> MIXED-VERSION partition with quorum
>>
>>   * Last updated: Thu Feb  5 15:14:45 2026
>>
>>   * Last change:  Thu Feb  5 15:12:09 2026 by root via root on memverge
>>
>>   * 2 nodes configured
>>
>>   * 23 resource instances configured
>>
>>
>>
>> Node List:
>>
>>   * Node memverge (27): online, feature set 3.20.1
>>
>>   * Node memverge2 (28): online, feature set <3.15.1
>>
>>
>>
>> Full List of Resources:
>>
>>   * Resource Group: g-nfs:
>>
>>     * pb_nfs    (ocf:heartbeat:portblock):       Started memverge
>>
>>     * ip0_nfs   (ocf:heartbeat:IPaddr2):         Started memverge
>>
>>     * fs_nfs_internal_info_HA   (ocf:heartbeat:Filesystem):      Started
>> memverge
>>
>>     * fs_nfsshare_exports_HA    (ocf:heartbeat:Filesystem):      Started
>> memverge
>>
>>     * nfsserver (ocf:heartbeat:nfsserver):       Started memverge
>>
>>     * expfs_nfsshare_exports_HA (ocf:heartbeat:exportfs):        Started
>> memverge
>>
>>     * samba_service     (systemd:smb):   Started memverge
>>
>>     * fs_sambashare_exports_HA  (ocf:heartbeat:Filesystem):      Started
>> memverge
>>
>>     * punb_nfs  (ocf:heartbeat:portblock):       Started memverge
>>
>>   * Resource Group: g-iscsi:
>>
>>     * pb_iscsi  (ocf:heartbeat:portblock):       Started memverge
>>
>>     * ip0_iscsi (ocf:heartbeat:IPaddr2):         Started memverge
>>
>>     * ip1_iscsi (ocf:heartbeat:IPaddr2):         Started memverge
>>
>>     * iscsi_target      (ocf:heartbeat:iSCSITarget):     Started memverge
>>
>>     * iscsi_lun_drbd3   (ocf:heartbeat:iSCSILogicalUnit):        Started
>> memverge
>>
>>     * iscsi_lun_drbd4   (ocf:heartbeat:iSCSILogicalUnit):        Started
>> memverge
>>
>>     * punb_iscsi        (ocf:heartbeat:portblock):       Started memverge
>>
>>   * Clone Set: ha-nfs-clone [ha-nfs] (promotable):
>>
>>     * ha-nfs    (ocf:linbit:drbd):       Unpromoted memverge2
>>
>>     * ha-nfs    (ocf:linbit:drbd):       Promoted memverge
>>
>>   * Clone Set: ha-iscsi-clone [ha-iscsi] (promotable):
>>
>>     * ha-iscsi  (ocf:linbit:drbd):       Unpromoted memverge2
>>
>>     * ha-iscsi  (ocf:linbit:drbd):       Promoted memverge
>>
>>   * ipmi-fence-memverge (stonith:fence_ipmilan):         Started memverge2
>>
>>   * ipmi-fence-memverge2        (stonith:fence_ipmilan):         Started
>> memverge
>>
>>   * watchdog-fencing    (stonith:fence_watchdog):        Starting
>> memverge2
>>
>>
>>
>> Failed Resource Actions:
>>
>>   * ipmi-fence-memverge_monitor_30000 on memverge2 'Error occurred' (1):
>> call=93, status='Error', exitreason='Lost connection to fencer'  *
>> ipmi-fence-memveF
>>
>>
>>
>> And there are so many records in /var/log/messages,
>>
>>
>>
>> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
>> connection failed (will retry): Transport endpoint is not connected
>>
>> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
>> connection failed (will retry): Transport endpoint is not connected
>>
>> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
>> connection failed (will retry): Transport endpoint is not connected
>>
>> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
>> connection failed (will retry): Transport endpoint is not connected
>>
>> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
>> connection failed (will retry): Transport endpoint is not connected
>>
>> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
>> connection failed (will retry): Transport endpoint is not connected
>>
>> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
>> connection failed (will retry): Transport endpoint is not connected
>>
>> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
>> connection failed (will retry): Transport endpoint is not connected
>>
>> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
>> connection failed (will retry): Transport endpoint is not connected
>>
>> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
>> connection failed (will retry): Transport endpoint is not connected
>>
>> [root@memverge ~]#
>>
>>
>>
>> I’m new in pacemaker/corosync, so it is quite complicated to me 😊
>>
>> Or may be add fence_ipmilan as level 1 and don’t add sbd as level 2,
>> assuming cluster should automatically detect it just because
>> have-watchdog=true and fallback to sbd even without explicit as level 2 ?
>>
>>
>>
>> Not sure what we're seeing. The 'Fencer connection failed ...' thing
>> would point to pacemaker-fenced having had a segfault or something.
>>
>> You might see traces of that elsewhere. And it would explain strange
>> behavior of pacemaker in general if it is constantly trying to
>>
>> restart pacemaker-fenced.
>>
>> But sry again I forgot to mention that the fence-resource has to be
>> called 'watchdog' otherwise pacemaker won't align it with the already
>> existent (if you have stonith-watchdog-timeout != 0) internal hidden
>> device.
>>
>> If not doing so this is probably untested (Don't remember if I had tested
>> that during development of the feature. It is definitely not a test-case
>> for CI or something.) and might lead to pacemaker-fenced having an issue.
>> So this should probably be fixed but if you use the correct
>>
>> naming it should work.
>>
>> Can you provide your cib & corosync-config as that we don't have to write
>> back and forth that often?
>>
>>
>>
>> Regards,
>>
>> Klaus
>>
>>
>>
>> Anton
>>
>>
>>
>> *From:* Klaus Wenninger <[email protected]>
>> *Sent:* Thursday, February 5, 2026 2:52 PM
>> *To:* Anton Gavriliuk <[email protected]>
>> *Cc:* Andrei Borzenkov <[email protected]>; Cluster Labs - All topics
>> related to open-source clustering welcomed <[email protected]>
>> *Subject:* Re: [ClusterLabs] Question about two level STONITH/fencing
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Feb 5, 2026 at 12:56 PM Anton Gavriliuk <[email protected]>
>> wrote:
>>
>>
>> Correct, in addition to two cluster nodes, there is dedicated 3rd node
>> physical server as qdevice.
>>
>> I'm thinking about two level fencing topology, 1st level - fence_ipmilan,
>> 2nd - diskless sbd (hpwdt, /dev/watchdog)
>>
>> But I can't add sbd as a 2nd level fencing,
>>
>> [root@memverge2 ~]# pcs stonith level add 2 memverge watchdog
>> Error: Stonith resource(s) 'watchdog' do not exist, use --force to
>> override
>> Error: Errors have occurred, therefore pcs is unable to continue
>> [root@memverge2 ~]#
>>
>> So back to the original question - what is the most correct way of
>> implementing STONITH/fencing with fence_iomilan + diskless sbd (hpwdt,
>> /dev/watchdog) ?
>>
>>
>>
>> Sorry then that I had overlooked qdevice (actually I thought I checked
>> for it but ...).
>>
>> For adding the watchdog into a topology you have to make it visible
>> before - just add it as any fencing-device with fence_watchdog as agent.
>>
>> There is a fence_watchdog script but that is just for the meta-data.
>> Pacemaker will recognize that hand handle the actual fencing internally.
>>
>>
>>
>> Regards,
>>
>> Klaus
>>
>>
>>
>>
>> Anton
>>
>>
>> -----Original Message-----
>> From: Andrei Borzenkov <[email protected]>
>> Sent: Thursday, February 5, 2026 1:17 PM
>> To: Cluster Labs - All topics related to open-source clustering welcomed <
>> [email protected]>
>> Cc: Anton Gavriliuk <[email protected]>
>> Subject: Re: [ClusterLabs] Question about two level STONITH/fencing
>>
>> On Thu, Feb 5, 2026 at 2:07 PM Klaus Wenninger <[email protected]>
>> wrote:
>> >
>> >
>> >
>> > On Wed, Feb 4, 2026 at 4:36 PM Anton Gavriliuk via Users <
>> [email protected]> wrote:
>> >>
>> >>
>> >>
>> >> Hello
>> >>
>> >>
>> >>
>> >> There is two-node (HPE DL345 Gen12 servers) shared-nothing DRBD-based
>> sync (Protocol C) replication, distributed active/standby pacemaker storage
>> metro-cluster. The distributed active/standby pacemaker storage
>> metro-cluster configured with qdevice, heuristics (parallel fping) and
>> fencing - fence_ipmilan and diskless sbd (hpwdt, /dev/watchdog). All
>> cluster resources are configured to always run together on the same node.
>> >>
>> >>
>> >>
>> >> The two storage cluster nodes and qdevice running on Rocky Linux 10.1
>> >>
>> >> Pacemaker version 3.0.1
>> >>
>> >> Corosync version 3.1.9
>> >>
>> >> DRBD version 9.3.0
>> >>
>> >>
>> >>
>> >> So, the question is – what is the most correct way of implementing
>> STONITH/fencing with fence_iomilan + diskless sbd (hpwdt, /dev/watchdog) ?
>> >
>> >
>> > The correct way of using diskless sbd with a two-node cluster is not
>> > to use it ;-)
>> >
>> > diskless sbd (watchdog-fencing) requires 'real' quorum and quorum
>> > provided by corosync in two-node mode would introduce split-brain
>> > which is the reason why sbd recognizes the two-node operation and
>> > replaces quorum from corosync by the information that the peer node is
>> currently in the cluster. This is fine for working with poison-pill fencing
>> - a single single shared disk then doesn't become a single-point-of-failure
>> as long as the peer is there. But for watchdog-fencing that doesn't help
>> because the peer going away would mean you have to commit suicide.
>> >
>> > and alternative with a two-node cluster is to step away from the actual
>> two-node design and go with qdevice for 'real' quorum.
>>
>> Hmm ... the original description does mention qdevice, although it is not
>> quite clear where it is located (is there the third node?)
>>
>> > You'll need some kind of 3rd node but it doesn't have to be a full
>> cluster node.
>> >
>>
>>

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Question about two level STONITH/fencing

Reply via email to