> Note that handling of clones is done on a different level, i.e. > by the CRM which decides where to run resources. The idea of > cloned stonith resources was to have "more" assurance that one of > the nodes which run the stonith resource may shoot the offending > node. Obviously, this may make sense only for clusters with more > than two nodes. On the other hand, if your stonith devices are > reliable and regularly monitored, I don't see any need for > shooting a node from more than one node. So, with the lights-out > devices which are capable of managing only its host (iLO, IBM > RSA, DRAC) I'd suggest having a normal (non-cloned) stonith > resource with a -INF constraint to prevent it from running on the > node it can shoot. This kind of power management setup seems to > be very popular and probably prevails today. > > On larger clusters with stonith devices which may shoot a set of > nodes, a single cloned resource should suffice. > > Does this help? A bit at least?
Dejan, This does help me understand that a clone stonith in a simple two node cluster is probably not necessary. I will backup my config. today and try a non-cloned resource to see what the behavior is and report back to the list. What I am really just trying to ensure is that on a failure of the STONITH resource to start/monitor that it will keep retrying to start/monitor. I want to avoid the situation where we have a node that is online again after a brief network outage and is capable of running resources but is not able to shoot its partner. I wasn't sure if this was actually a bug or more a configuration/operational/understanding issue on my part. To add more information on the issue I did follow some of Tak's comments and took a look at the fail-count for the resource and it is at INFINITY (following the test failures): # crm_failcount -G -r cl_stonith_lb02:0 name=fail-count-cl_stonith_lb02:0 value=INFINITY I then cleared the failcount and made sure it took... # crm_failcount -D -r cl_stonith_lb02:0 # crm_failcount -G -r cl_stonith_lb02:0 name=fail-count-cl_stonith_lb02:0 value=0 And did a cleanup for both nodes: # crm_resource -C -r cl_stonith_lb02:0 -H wwwlb01.microcenter.com # crm_resource -C -r cl_stonith_lb02:0 -H wwwlb02.microcenter.com The stonith resource did restart and appears to be back to normal. Just wondering if this would be the correct process to follow in the future or can the failure to retry interval be adjusted in the CIB or is this a bug? Thanks for all the help, -ab _______________________________________________ Pacemaker mailing list [email protected] http://list.clusterlabs.org/mailman/listinfo/pacemaker
