On Wed, Nov 05, 2008 at 09:03:48AM -0500, Aaron Bush wrote: > > Note that handling of clones is done on a different level, i.e. > > by the CRM which decides where to run resources. The idea of > > cloned stonith resources was to have "more" assurance that one of > > the nodes which run the stonith resource may shoot the offending > > node. Obviously, this may make sense only for clusters with more > > than two nodes. On the other hand, if your stonith devices are > > reliable and regularly monitored, I don't see any need for > > shooting a node from more than one node. So, with the lights-out > > devices which are capable of managing only its host (iLO, IBM > > RSA, DRAC) I'd suggest having a normal (non-cloned) stonith > > resource with a -INF constraint to prevent it from running on the > > node it can shoot. This kind of power management setup seems to > > be very popular and probably prevails today. > > > > On larger clusters with stonith devices which may shoot a set of > > nodes, a single cloned resource should suffice. > > > > Does this help? A bit at least? > > Dejan, > > This does help me understand that a clone stonith in a simple two node > cluster is probably not necessary. I will backup my config. today and > try a non-cloned resource to see what the behavior is and report back to > the list. > > > What I am really just trying to ensure is that on a failure of the > STONITH resource to start/monitor that it will keep retrying to > start/monitor. I want to avoid the situation where we have a node that > is online again after a brief network outage and is capable of running > resources but is not able to shoot its partner. I wasn't sure if this > was actually a bug or more a configuration/operational/understanding > issue on my part. > > To add more information on the issue I did follow some of Tak's comments > and took a look at the fail-count for the resource and it is at INFINITY > (following the test failures): > > # crm_failcount -G -r cl_stonith_lb02:0 > name=fail-count-cl_stonith_lb02:0 value=INFINITY
Yes, that would probably prevent the resource from starting on that node. > I then cleared the failcount and made sure it took... > > # crm_failcount -D -r cl_stonith_lb02:0 > # crm_failcount -G -r cl_stonith_lb02:0 > name=fail-count-cl_stonith_lb02:0 value=0 > > And did a cleanup for both nodes: > > # crm_resource -C -r cl_stonith_lb02:0 -H wwwlb01.microcenter.com > # crm_resource -C -r cl_stonith_lb02:0 -H wwwlb02.microcenter.com > > The stonith resource did restart and appears to be back to normal. Just > wondering if this would be the correct process to follow in the future > or can the failure to retry interval be adjusted in the CIB or is this a > bug? Can't say without looking at the logs. I think that you should start a new tread with these concerns. There are also quite a few discussions on the topic in list archives. Note that in this respect stonith resources are treated like any other. Thanks, Dejan > Thanks for all the help, > -ab > > _______________________________________________ > Pacemaker mailing list > [email protected] > http://list.clusterlabs.org/mailman/listinfo/pacemaker _______________________________________________ Pacemaker mailing list [email protected] http://list.clusterlabs.org/mailman/listinfo/pacemaker
