Re: [ClusterLabs] Question about two level STONITH/fencing

Klaus Wenninger Thu, 05 Feb 2026 06:16:45 -0800

On Thu, Feb 5, 2026 at 3:07 PM Anton Gavriliuk <[email protected]>
wrote:


>
>    - But sry again I forgot to mention that the fence-resource has to be
>    called 'watchdog' otherwise pacemaker won't align it with the already
>    existent (if you have stonith-watchdog-timeout != 0) internal hidden
>    device.
>
>
>
> [root@memverge ~]# pcs stonith create watchdog-fencing watchdog
>
> Error: Agent 'stonith:watchdog' is not installed or does not provide valid
> metadata: crm_resource: Metadata query for stonith:watchdog failed: No such
> device or address, Error performing operation: No such object, use --force
> to override
>
> Error: Errors have occurred, therefore pcs is unable to continue
>

The other way round: pcs stonith create watchdog fence_watchdog


> [root@memverge ~]#
>
>
>
>    - Can you provide your cib & corosync-config as that we don't have to
>    write back and forth that often?
>
>
>
> I attached it in the files.
>
>
>
> Anton
>
>
>
> *From:* Klaus Wenninger <[email protected]>
> *Sent:* Thursday, February 5, 2026 3:42 PM
> *To:* Anton Gavriliuk <[email protected]>
> *Cc:* Andrei Borzenkov <[email protected]>; Cluster Labs - All topics
> related to open-source clustering welcomed <[email protected]>
> *Subject:* Re: [ClusterLabs] Question about two level STONITH/fencing
>
>
>
>
>
>
>
> On Thu, Feb 5, 2026 at 2:21 PM Anton Gavriliuk <[email protected]>
> wrote:
>
> I tried,
>
>
>
> [root@memverge ~]# pcs stonith create watchdog-fencing fence_watchdog
>
>
>
> But after that, the running cluster is hanging...., I can't run "crm_mon
> -Rr", “error: Lost connection to controller”
>
>
>
> Perhaps this is due to /dev/watchdog is already managed by pacemaker ?
>
>
>
> [root@memverge ~]# systemctl status sbd
>
> ● sbd.service - Shared-storage based fencing daemon
>
>      Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; preset:
> disabled)
>
>     Drop-In: /etc/systemd/system/sbd.service.d
>
>              └─override.conf
>
>      Active: active (running) since Tue 2026-02-03 16:09:00 EET; 1 day 22h
> ago
>
> Invocation: 11a9ba526ef5403682980d67a886a7b9
>
>        Docs: man:sbd(8)
>
>    Main PID: 2473 (sbd)
>
>       Tasks: 3 (limit: 3355442)
>
>      Memory: 18.8M (peak: 19.5M)
>
>         CPU: 2min 22.568s
>
>      CGroup: /system.slice/sbd.service
>
>              ├─2473 "sbd: inquisitor"
>
>              ├─2487 "sbd: watcher: Pacemaker"
>
>              └─2488 "sbd: watcher: Cluster"
>
>
>
> Feb 03 16:09:00 memverge sbd[2473]:   notice: inquisitor_child: Servant
> cluster is healthy (age: 0)
>
> Feb 03 16:09:00 memverge sbd[2473]:   notice: watchdog_init: Using
> watchdog device '/dev/watchdog'
>
> Feb 03 16:09:00 memverge systemd[1]: Started sbd.service - Shared-storage
> based fencing daemon.
>
> Feb 03 16:09:04 memverge sbd[2473]:   notice: inquisitor_child: Servant
> pcmk is healthy (age: 0)
>
> Feb 03 16:11:27 memverge systemd[1]:
> /etc/systemd/system/sbd.service.d/override.conf:1: Assignment outside of
> section. Ignoring.
>
> Feb 03 16:11:28 memverge systemd[1]:
> /etc/systemd/system/sbd.service.d/override.conf:1: Assignment outside of
> section. Ignoring.
>
> Feb 03 16:25:02 memverge sbd[2473]:  warning: inquisitor_child: pcmk
> health check: UNHEALTHY
>
> Feb 03 16:25:02 memverge sbd[2473]:  warning: inquisitor_child: Servant
> pcmk is outdated (age: 1246)
>
> Feb 03 16:25:03 memverge sbd[2473]:   notice: inquisitor_child: Servant
> pcmk is healthy (age: 0)
>
> Feb 05 15:01:05 memverge systemd[1]:
> /etc/systemd/system/sbd.service.d/override.conf:1: Assignment outside of
> section. Ignoring.
>
> [root@memverge ~]#
>
>
>
> Oh.., now it opened,
>
>
>
> Cluster Summary:
>
>   * Stack: corosync (Pacemaker is running)
>
>   * Current DC: memverge (27) (version 3.0.1-3.el10-b1a23a6) -
> MIXED-VERSION partition with quorum
>
>   * Last updated: Thu Feb  5 15:14:45 2026
>
>   * Last change:  Thu Feb  5 15:12:09 2026 by root via root on memverge
>
>   * 2 nodes configured
>
>   * 23 resource instances configured
>
>
>
> Node List:
>
>   * Node memverge (27): online, feature set 3.20.1
>
>   * Node memverge2 (28): online, feature set <3.15.1
>
>
>
> Full List of Resources:
>
>   * Resource Group: g-nfs:
>
>     * pb_nfs    (ocf:heartbeat:portblock):       Started memverge
>
>     * ip0_nfs   (ocf:heartbeat:IPaddr2):         Started memverge
>
>     * fs_nfs_internal_info_HA   (ocf:heartbeat:Filesystem):      Started
> memverge
>
>     * fs_nfsshare_exports_HA    (ocf:heartbeat:Filesystem):      Started
> memverge
>
>     * nfsserver (ocf:heartbeat:nfsserver):       Started memverge
>
>     * expfs_nfsshare_exports_HA (ocf:heartbeat:exportfs):        Started
> memverge
>
>     * samba_service     (systemd:smb):   Started memverge
>
>     * fs_sambashare_exports_HA  (ocf:heartbeat:Filesystem):      Started
> memverge
>
>     * punb_nfs  (ocf:heartbeat:portblock):       Started memverge
>
>   * Resource Group: g-iscsi:
>
>     * pb_iscsi  (ocf:heartbeat:portblock):       Started memverge
>
>     * ip0_iscsi (ocf:heartbeat:IPaddr2):         Started memverge
>
>     * ip1_iscsi (ocf:heartbeat:IPaddr2):         Started memverge
>
>     * iscsi_target      (ocf:heartbeat:iSCSITarget):     Started memverge
>
>     * iscsi_lun_drbd3   (ocf:heartbeat:iSCSILogicalUnit):        Started
> memverge
>
>     * iscsi_lun_drbd4   (ocf:heartbeat:iSCSILogicalUnit):        Started
> memverge
>
>     * punb_iscsi        (ocf:heartbeat:portblock):       Started memverge
>
>   * Clone Set: ha-nfs-clone [ha-nfs] (promotable):
>
>     * ha-nfs    (ocf:linbit:drbd):       Unpromoted memverge2
>
>     * ha-nfs    (ocf:linbit:drbd):       Promoted memverge
>
>   * Clone Set: ha-iscsi-clone [ha-iscsi] (promotable):
>
>     * ha-iscsi  (ocf:linbit:drbd):       Unpromoted memverge2
>
>     * ha-iscsi  (ocf:linbit:drbd):       Promoted memverge
>
>   * ipmi-fence-memverge (stonith:fence_ipmilan):         Started memverge2
>
>   * ipmi-fence-memverge2        (stonith:fence_ipmilan):         Started
> memverge
>
>   * watchdog-fencing    (stonith:fence_watchdog):        Starting memverge2
>
>
>
> Failed Resource Actions:
>
>   * ipmi-fence-memverge_monitor_30000 on memverge2 'Error occurred' (1):
> call=93, status='Error', exitreason='Lost connection to fencer'  *
> ipmi-fence-memveF
>
>
>
> And there are so many records in /var/log/messages,
>
>
>
> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> Feb  5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> [root@memverge ~]#
>
>
>
> I’m new in pacemaker/corosync, so it is quite complicated to me 😊
>
> Or may be add fence_ipmilan as level 1 and don’t add sbd as level 2,
> assuming cluster should automatically detect it just because
> have-watchdog=true and fallback to sbd even without explicit as level 2 ?
>
>
>
> Not sure what we're seeing. The 'Fencer connection failed ...' thing would
> point to pacemaker-fenced having had a segfault or something.
>
> You might see traces of that elsewhere. And it would explain strange
> behavior of pacemaker in general if it is constantly trying to
>
> restart pacemaker-fenced.
>
> But sry again I forgot to mention that the fence-resource has to be called
> 'watchdog' otherwise pacemaker won't align it with the already
> existent (if you have stonith-watchdog-timeout != 0) internal hidden
> device.
>
> If not doing so this is probably untested (Don't remember if I had tested
> that during development of the feature. It is definitely not a test-case
> for CI or something.) and might lead to pacemaker-fenced having an issue.
> So this should probably be fixed but if you use the correct
>
> naming it should work.
>
> Can you provide your cib & corosync-config as that we don't have to write
> back and forth that often?
>
>
>
> Regards,
>
> Klaus
>
>
>
> Anton
>
>
>
> *From:* Klaus Wenninger <[email protected]>
> *Sent:* Thursday, February 5, 2026 2:52 PM
> *To:* Anton Gavriliuk <[email protected]>
> *Cc:* Andrei Borzenkov <[email protected]>; Cluster Labs - All topics
> related to open-source clustering welcomed <[email protected]>
> *Subject:* Re: [ClusterLabs] Question about two level STONITH/fencing
>
>
>
>
>
>
>
> On Thu, Feb 5, 2026 at 12:56 PM Anton Gavriliuk <[email protected]>
> wrote:
>
>
> Correct, in addition to two cluster nodes, there is dedicated 3rd node
> physical server as qdevice.
>
> I'm thinking about two level fencing topology, 1st level - fence_ipmilan,
> 2nd - diskless sbd (hpwdt, /dev/watchdog)
>
> But I can't add sbd as a 2nd level fencing,
>
> [root@memverge2 ~]# pcs stonith level add 2 memverge watchdog
> Error: Stonith resource(s) 'watchdog' do not exist, use --force to override
> Error: Errors have occurred, therefore pcs is unable to continue
> [root@memverge2 ~]#
>
> So back to the original question - what is the most correct way of
> implementing STONITH/fencing with fence_iomilan + diskless sbd (hpwdt,
> /dev/watchdog) ?
>
>
>
> Sorry then that I had overlooked qdevice (actually I thought I checked for
> it but ...).
>
> For adding the watchdog into a topology you have to make it visible before
> - just add it as any fencing-device with fence_watchdog as agent.
>
> There is a fence_watchdog script but that is just for the meta-data.
> Pacemaker will recognize that hand handle the actual fencing internally.
>
>
>
> Regards,
>
> Klaus
>
>
>
>
> Anton
>
>
> -----Original Message-----
> From: Andrei Borzenkov <[email protected]>
> Sent: Thursday, February 5, 2026 1:17 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed <
> [email protected]>
> Cc: Anton Gavriliuk <[email protected]>
> Subject: Re: [ClusterLabs] Question about two level STONITH/fencing
>
> On Thu, Feb 5, 2026 at 2:07 PM Klaus Wenninger <[email protected]>
> wrote:
> >
> >
> >
> > On Wed, Feb 4, 2026 at 4:36 PM Anton Gavriliuk via Users <
> [email protected]> wrote:
> >>
> >>
> >>
> >> Hello
> >>
> >>
> >>
> >> There is two-node (HPE DL345 Gen12 servers) shared-nothing DRBD-based
> sync (Protocol C) replication, distributed active/standby pacemaker storage
> metro-cluster. The distributed active/standby pacemaker storage
> metro-cluster configured with qdevice, heuristics (parallel fping) and
> fencing - fence_ipmilan and diskless sbd (hpwdt, /dev/watchdog). All
> cluster resources are configured to always run together on the same node.
> >>
> >>
> >>
> >> The two storage cluster nodes and qdevice running on Rocky Linux 10.1
> >>
> >> Pacemaker version 3.0.1
> >>
> >> Corosync version 3.1.9
> >>
> >> DRBD version 9.3.0
> >>
> >>
> >>
> >> So, the question is – what is the most correct way of implementing
> STONITH/fencing with fence_iomilan + diskless sbd (hpwdt, /dev/watchdog) ?
> >
> >
> > The correct way of using diskless sbd with a two-node cluster is not
> > to use it ;-)
> >
> > diskless sbd (watchdog-fencing) requires 'real' quorum and quorum
> > provided by corosync in two-node mode would introduce split-brain
> > which is the reason why sbd recognizes the two-node operation and
> > replaces quorum from corosync by the information that the peer node is
> currently in the cluster. This is fine for working with poison-pill fencing
> - a single single shared disk then doesn't become a single-point-of-failure
> as long as the peer is there. But for watchdog-fencing that doesn't help
> because the peer going away would mean you have to commit suicide.
> >
> > and alternative with a two-node cluster is to step away from the actual
> two-node design and go with qdevice for 'real' quorum.
>
> Hmm ... the original description does mention qdevice, although it is not
> quite clear where it is located (is there the third node?)
>
> > You'll need some kind of 3rd node but it doesn't have to be a full
> cluster node.
> >
>
>

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Question about two level STONITH/fencing

Reply via email to