On 10/08/18 10:51 +0200, Ferenc Wágner wrote: > Failure story for amusement: the blades expose a BMC watchdog device to > the OS, which was picked up by Corosync. It seemed like a useful second > line of defense in case fencing (BMC IPMI power) failed for any reason; > I let it live and forgot about it. Months later, after a firmware > upgrade the BMC had to be restarted, and the watchdog device ioctl > blocked Corosync for a minute or so. Of course membership fell apart. > Actually, across the full cluster, because the BMC restarts were > preformed back-to-back (I authorized a single restart only, but anyway). > I leave the rest to your imagination. Fencing (STONITH) worked (with > delays) until quorum dissolved entirely... after a couple of minutes, it > was over. We spent the rest of the day picking up the pieces, then the > next few trying to reproduce the perceived Corosync network outage > during BMC reboots without the cluster stack running. Of course in > total vain. Half a year later an independent investigation of sporadic > small Corosync delays revealed the watchdog connection, then we disabled > the feature. Don't use (poorly implemented) BMC watchdogs.
Thanks for sharing these lessons learned, good to be reminded how far the SPOF risks spread, even at seemingly improbable places. For instance, with software-only watchdog, they are substantially more blatant, so while its standalone configuration cannot be recommended sensibly, using it as a backup watchdog may not be that bad idea, after all (loop over all configured watchdogs opened with O_NONBLOCK flag?). -- Nazdar, Jan (Poki)
pgppxysVJOkvJ.pgp
Description: PGP signature
_______________________________________________ Users mailing list: [email protected] https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
