Hi, I'm testing a two-node cluster using SBD 1.4.0, Corosync 2.4.2, and Pacemaker 1.1.16. For testing, I have one shared block storage device between the two nodes, and each node has an "IPMI watchdog" device available at '/dev/watchdog'.
Up until today, I've been testing the SBD fencing functionality with "stonith-action" set to the default value of "reboot", and this has worked excellent so far. Then I wanted to test using "stonith-action" set to "off" to power off the node when fencing, instead of rebooting it. I set this, then used stonith_admin to fence one of the nodes, and I was surprised that it did not turn off... I see this for the SBD daemon logs: --snip-- Mar 14 19:38:02 testnode-2 sbd[2114]: /dev/disk/by-id/nvme-eui.0000000000000005000cca0b01592504: notice: servant: Received command off from testnode-1 on disk /dev/disk/by-id/nvme-eui.0000000000000005000cca0b01592504 Mar 14 19:38:02 testnode-2 sbd[2108]: warning: inquisitor_child: /dev/disk/by-id/nvme-eui.0000000000000005000cca0b01592504 requested a shutoff Mar 14 19:38:02 testnode-2 sbd[2108]: emerg: do_exit: Rebooting system: off Mar 14 19:38:02 testnode-2 sbd[2108]: info: sysrq_trigger: sysrq-trigger: o Mar 14 19:38:12 testnode-2 sbd[2116]: pcmk: error: crm_ipc_read: Connection to cib_ro failed Mar 14 19:38:12 testnode-2 sbd[2116]: pcmk: error: mainloop_gio_callback: Connection to cib_ro[0x1272fa0] closed (I/O condition=1) Mar 14 19:38:12 testnode-2 sbd[2116]: pcmk: warning: set_servant_health: Disconnected from CIB Mar 14 19:38:13 testnode-2 sbd[2116]: pcmk: warning: mon_timer_reconnect: CIB reconnect failed: -107 Mar 14 19:38:14 testnode-2 sbd[2116]: pcmk: warning: set_servant_health: Node state: pending Mar 14 19:38:14 testnode-2 sbd[2116]: pcmk: warning: mon_timer_reconnect: CIB reconnect failed: -107 --snip-- But writing 'o' into '/proc/sysrq-trigger' failed, it didn't work... I see it tried to in the kernel logs: [ 1902.935589] sysrq: SysRq : Power Off But it just didn't go... it turns it there is a bug in another driver, and the task is hung, which is why using 'o' doesn't work correctly on this system. That's another issue for me to work through... My question is: Why didn't the watchdog device reboot the system? The "off" operation didn't work, so I was expecting the watchdog to not get poked, and then handle resetting the node. I saw this message when it appears the IPMI watchdog device was closed (by sbd I assume): [ 1902.933851] IPMI Watchdog: Unexpected close, not stopping watchdog! Still reading up on the watchdog devices, but I guess I'm looking for guidance to focus my search: Should the node have been reset via the watchdog device? Is that the expected behavior? Or is it not expected in this scenario for the watchdog device to reset the system? (Note: I confirmed the watchdog device does work by using 'sbd test-watchdog'.) Any help or tips would be greatly appreciated. Thanks, Marc _______________________________________________ Users mailing list: [email protected] https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
