On Thu, Feb 5, 2026 at 3:07 PM Anton Gavriliuk <[email protected]> wrote:
> > - But sry again I forgot to mention that the fence-resource has to be > called 'watchdog' otherwise pacemaker won't align it with the already > existent (if you have stonith-watchdog-timeout != 0) internal hidden > device. > > > > [root@memverge ~]# pcs stonith create watchdog-fencing watchdog > > Error: Agent 'stonith:watchdog' is not installed or does not provide valid > metadata: crm_resource: Metadata query for stonith:watchdog failed: No such > device or address, Error performing operation: No such object, use --force > to override > > Error: Errors have occurred, therefore pcs is unable to continue > The other way round: pcs stonith create watchdog fence_watchdog > [root@memverge ~]# > > > > - Can you provide your cib & corosync-config as that we don't have to > write back and forth that often? > > > > I attached it in the files. > > > > Anton > > > > *From:* Klaus Wenninger <[email protected]> > *Sent:* Thursday, February 5, 2026 3:42 PM > *To:* Anton Gavriliuk <[email protected]> > *Cc:* Andrei Borzenkov <[email protected]>; Cluster Labs - All topics > related to open-source clustering welcomed <[email protected]> > *Subject:* Re: [ClusterLabs] Question about two level STONITH/fencing > > > > > > > > On Thu, Feb 5, 2026 at 2:21 PM Anton Gavriliuk <[email protected]> > wrote: > > I tried, > > > > [root@memverge ~]# pcs stonith create watchdog-fencing fence_watchdog > > > > But after that, the running cluster is hanging...., I can't run "crm_mon > -Rr", “error: Lost connection to controller” > > > > Perhaps this is due to /dev/watchdog is already managed by pacemaker ? > > > > [root@memverge ~]# systemctl status sbd > > ● sbd.service - Shared-storage based fencing daemon > > Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; preset: > disabled) > > Drop-In: /etc/systemd/system/sbd.service.d > > └─override.conf > > Active: active (running) since Tue 2026-02-03 16:09:00 EET; 1 day 22h > ago > > Invocation: 11a9ba526ef5403682980d67a886a7b9 > > Docs: man:sbd(8) > > Main PID: 2473 (sbd) > > Tasks: 3 (limit: 3355442) > > Memory: 18.8M (peak: 19.5M) > > CPU: 2min 22.568s > > CGroup: /system.slice/sbd.service > > ├─2473 "sbd: inquisitor" > > ├─2487 "sbd: watcher: Pacemaker" > > └─2488 "sbd: watcher: Cluster" > > > > Feb 03 16:09:00 memverge sbd[2473]: notice: inquisitor_child: Servant > cluster is healthy (age: 0) > > Feb 03 16:09:00 memverge sbd[2473]: notice: watchdog_init: Using > watchdog device '/dev/watchdog' > > Feb 03 16:09:00 memverge systemd[1]: Started sbd.service - Shared-storage > based fencing daemon. > > Feb 03 16:09:04 memverge sbd[2473]: notice: inquisitor_child: Servant > pcmk is healthy (age: 0) > > Feb 03 16:11:27 memverge systemd[1]: > /etc/systemd/system/sbd.service.d/override.conf:1: Assignment outside of > section. Ignoring. > > Feb 03 16:11:28 memverge systemd[1]: > /etc/systemd/system/sbd.service.d/override.conf:1: Assignment outside of > section. Ignoring. > > Feb 03 16:25:02 memverge sbd[2473]: warning: inquisitor_child: pcmk > health check: UNHEALTHY > > Feb 03 16:25:02 memverge sbd[2473]: warning: inquisitor_child: Servant > pcmk is outdated (age: 1246) > > Feb 03 16:25:03 memverge sbd[2473]: notice: inquisitor_child: Servant > pcmk is healthy (age: 0) > > Feb 05 15:01:05 memverge systemd[1]: > /etc/systemd/system/sbd.service.d/override.conf:1: Assignment outside of > section. Ignoring. > > [root@memverge ~]# > > > > Oh.., now it opened, > > > > Cluster Summary: > > * Stack: corosync (Pacemaker is running) > > * Current DC: memverge (27) (version 3.0.1-3.el10-b1a23a6) - > MIXED-VERSION partition with quorum > > * Last updated: Thu Feb 5 15:14:45 2026 > > * Last change: Thu Feb 5 15:12:09 2026 by root via root on memverge > > * 2 nodes configured > > * 23 resource instances configured > > > > Node List: > > * Node memverge (27): online, feature set 3.20.1 > > * Node memverge2 (28): online, feature set <3.15.1 > > > > Full List of Resources: > > * Resource Group: g-nfs: > > * pb_nfs (ocf:heartbeat:portblock): Started memverge > > * ip0_nfs (ocf:heartbeat:IPaddr2): Started memverge > > * fs_nfs_internal_info_HA (ocf:heartbeat:Filesystem): Started > memverge > > * fs_nfsshare_exports_HA (ocf:heartbeat:Filesystem): Started > memverge > > * nfsserver (ocf:heartbeat:nfsserver): Started memverge > > * expfs_nfsshare_exports_HA (ocf:heartbeat:exportfs): Started > memverge > > * samba_service (systemd:smb): Started memverge > > * fs_sambashare_exports_HA (ocf:heartbeat:Filesystem): Started > memverge > > * punb_nfs (ocf:heartbeat:portblock): Started memverge > > * Resource Group: g-iscsi: > > * pb_iscsi (ocf:heartbeat:portblock): Started memverge > > * ip0_iscsi (ocf:heartbeat:IPaddr2): Started memverge > > * ip1_iscsi (ocf:heartbeat:IPaddr2): Started memverge > > * iscsi_target (ocf:heartbeat:iSCSITarget): Started memverge > > * iscsi_lun_drbd3 (ocf:heartbeat:iSCSILogicalUnit): Started > memverge > > * iscsi_lun_drbd4 (ocf:heartbeat:iSCSILogicalUnit): Started > memverge > > * punb_iscsi (ocf:heartbeat:portblock): Started memverge > > * Clone Set: ha-nfs-clone [ha-nfs] (promotable): > > * ha-nfs (ocf:linbit:drbd): Unpromoted memverge2 > > * ha-nfs (ocf:linbit:drbd): Promoted memverge > > * Clone Set: ha-iscsi-clone [ha-iscsi] (promotable): > > * ha-iscsi (ocf:linbit:drbd): Unpromoted memverge2 > > * ha-iscsi (ocf:linbit:drbd): Promoted memverge > > * ipmi-fence-memverge (stonith:fence_ipmilan): Started memverge2 > > * ipmi-fence-memverge2 (stonith:fence_ipmilan): Started > memverge > > * watchdog-fencing (stonith:fence_watchdog): Starting memverge2 > > > > Failed Resource Actions: > > * ipmi-fence-memverge_monitor_30000 on memverge2 'Error occurred' (1): > call=93, status='Error', exitreason='Lost connection to fencer' * > ipmi-fence-memveF > > > > And there are so many records in /var/log/messages, > > > > Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer > connection failed (will retry): Transport endpoint is not connected > > Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer > connection failed (will retry): Transport endpoint is not connected > > Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer > connection failed (will retry): Transport endpoint is not connected > > Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer > connection failed (will retry): Transport endpoint is not connected > > Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer > connection failed (will retry): Transport endpoint is not connected > > Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer > connection failed (will retry): Transport endpoint is not connected > > Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer > connection failed (will retry): Transport endpoint is not connected > > Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer > connection failed (will retry): Transport endpoint is not connected > > Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer > connection failed (will retry): Transport endpoint is not connected > > Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer > connection failed (will retry): Transport endpoint is not connected > > [root@memverge ~]# > > > > I’m new in pacemaker/corosync, so it is quite complicated to me 😊 > > Or may be add fence_ipmilan as level 1 and don’t add sbd as level 2, > assuming cluster should automatically detect it just because > have-watchdog=true and fallback to sbd even without explicit as level 2 ? > > > > Not sure what we're seeing. The 'Fencer connection failed ...' thing would > point to pacemaker-fenced having had a segfault or something. > > You might see traces of that elsewhere. And it would explain strange > behavior of pacemaker in general if it is constantly trying to > > restart pacemaker-fenced. > > But sry again I forgot to mention that the fence-resource has to be called > 'watchdog' otherwise pacemaker won't align it with the already > existent (if you have stonith-watchdog-timeout != 0) internal hidden > device. > > If not doing so this is probably untested (Don't remember if I had tested > that during development of the feature. It is definitely not a test-case > for CI or something.) and might lead to pacemaker-fenced having an issue. > So this should probably be fixed but if you use the correct > > naming it should work. > > Can you provide your cib & corosync-config as that we don't have to write > back and forth that often? > > > > Regards, > > Klaus > > > > Anton > > > > *From:* Klaus Wenninger <[email protected]> > *Sent:* Thursday, February 5, 2026 2:52 PM > *To:* Anton Gavriliuk <[email protected]> > *Cc:* Andrei Borzenkov <[email protected]>; Cluster Labs - All topics > related to open-source clustering welcomed <[email protected]> > *Subject:* Re: [ClusterLabs] Question about two level STONITH/fencing > > > > > > > > On Thu, Feb 5, 2026 at 12:56 PM Anton Gavriliuk <[email protected]> > wrote: > > > Correct, in addition to two cluster nodes, there is dedicated 3rd node > physical server as qdevice. > > I'm thinking about two level fencing topology, 1st level - fence_ipmilan, > 2nd - diskless sbd (hpwdt, /dev/watchdog) > > But I can't add sbd as a 2nd level fencing, > > [root@memverge2 ~]# pcs stonith level add 2 memverge watchdog > Error: Stonith resource(s) 'watchdog' do not exist, use --force to override > Error: Errors have occurred, therefore pcs is unable to continue > [root@memverge2 ~]# > > So back to the original question - what is the most correct way of > implementing STONITH/fencing with fence_iomilan + diskless sbd (hpwdt, > /dev/watchdog) ? > > > > Sorry then that I had overlooked qdevice (actually I thought I checked for > it but ...). > > For adding the watchdog into a topology you have to make it visible before > - just add it as any fencing-device with fence_watchdog as agent. > > There is a fence_watchdog script but that is just for the meta-data. > Pacemaker will recognize that hand handle the actual fencing internally. > > > > Regards, > > Klaus > > > > > Anton > > > -----Original Message----- > From: Andrei Borzenkov <[email protected]> > Sent: Thursday, February 5, 2026 1:17 PM > To: Cluster Labs - All topics related to open-source clustering welcomed < > [email protected]> > Cc: Anton Gavriliuk <[email protected]> > Subject: Re: [ClusterLabs] Question about two level STONITH/fencing > > On Thu, Feb 5, 2026 at 2:07 PM Klaus Wenninger <[email protected]> > wrote: > > > > > > > > On Wed, Feb 4, 2026 at 4:36 PM Anton Gavriliuk via Users < > [email protected]> wrote: > >> > >> > >> > >> Hello > >> > >> > >> > >> There is two-node (HPE DL345 Gen12 servers) shared-nothing DRBD-based > sync (Protocol C) replication, distributed active/standby pacemaker storage > metro-cluster. The distributed active/standby pacemaker storage > metro-cluster configured with qdevice, heuristics (parallel fping) and > fencing - fence_ipmilan and diskless sbd (hpwdt, /dev/watchdog). All > cluster resources are configured to always run together on the same node. > >> > >> > >> > >> The two storage cluster nodes and qdevice running on Rocky Linux 10.1 > >> > >> Pacemaker version 3.0.1 > >> > >> Corosync version 3.1.9 > >> > >> DRBD version 9.3.0 > >> > >> > >> > >> So, the question is – what is the most correct way of implementing > STONITH/fencing with fence_iomilan + diskless sbd (hpwdt, /dev/watchdog) ? > > > > > > The correct way of using diskless sbd with a two-node cluster is not > > to use it ;-) > > > > diskless sbd (watchdog-fencing) requires 'real' quorum and quorum > > provided by corosync in two-node mode would introduce split-brain > > which is the reason why sbd recognizes the two-node operation and > > replaces quorum from corosync by the information that the peer node is > currently in the cluster. This is fine for working with poison-pill fencing > - a single single shared disk then doesn't become a single-point-of-failure > as long as the peer is there. But for watchdog-fencing that doesn't help > because the peer going away would mean you have to commit suicide. > > > > and alternative with a two-node cluster is to step away from the actual > two-node design and go with qdevice for 'real' quorum. > > Hmm ... the original description does mention qdevice, although it is not > quite clear where it is located (is there the third node?) > > > You'll need some kind of 3rd node but it doesn't have to be a full > cluster node. > > > >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
