Re: [ClusterLabs] digression: Corosync watchdog experience

Jan Pokorný Fri, 10 Aug 2018 05:26:41 -0700

On 10/08/18 10:51 +0200, Ferenc Wágner wrote:
> Failure story for amusement: the blades expose a BMC watchdog device to
> the OS, which was picked up by Corosync.  It seemed like a useful second
> line of defense in case fencing (BMC IPMI power) failed for any reason;
> I let it live and forgot about it.  Months later, after a firmware
> upgrade the BMC had to be restarted, and the watchdog device ioctl
> blocked Corosync for a minute or so.  Of course membership fell apart.
> Actually, across the full cluster, because the BMC restarts were
> preformed back-to-back (I authorized a single restart only, but anyway).
> I leave the rest to your imagination.  Fencing (STONITH) worked (with
> delays) until quorum dissolved entirely... after a couple of minutes, it
> was over.  We spent the rest of the day picking up the pieces, then the
> next few trying to reproduce the perceived Corosync network outage
> during BMC reboots without the cluster stack running.  Of course in
> total vain.  Half a year later an independent investigation of sporadic
> small Corosync delays revealed the watchdog connection, then we disabled
> the feature.  Don't use (poorly implemented) BMC watchdogs.


Thanks for sharing these lessons learned, good to be reminded how far
the SPOF risks spread, even at seemingly improbable places.  For
instance, with software-only watchdog, they are substantially more
blatant, so while its standalone configuration cannot be recommended
sensibly, using it as a backup watchdog may not be that bad idea,
after all (loop over all configured watchdogs opened with O_NONBLOCK
flag?).

-- 
Nazdar,
Jan (Poki)

pgppxysVJOkvJ.pgp
Description: PGP signature

_______________________________________________
Users mailing list: [email protected]
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] digression: Corosync watchdog experience

Reply via email to