>So a monitor failure on the fence agent rendered the cluster effectively unresponsive? How would I normally recover from this? Actually it will ban the resource (stonith) from the node when it reaches the maximum fail count. When the stonith is banned from all nodes, no node will be able to use that stonith.
You can use 'failure-timeout' meta attribute to reset the fail count. I'm using it for the ipmi fencing mechanisms. Of course the best approach is to make that stonith more reliable but usually this is out of our control. Another approach is to define a second stonith method and use stonith topology. Best Regards,Strahil Nikolov
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
