On Tue, Jun 24, 2008 at 04:02:06PM +0200, Lars Marowsky-Bree wrote: > On 2008-06-24T15:48:12, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > > > > But precisely we have two scenarios to configure to: > > > a) monitor NG -> stop -> start on the same node > > > -> monitor NG (Nth time) -> stop -> failover to another node > > > b) monitor NG -> monitor NG (Nth times) -> stop -> failover to another > > > node > > > > > > The current pacemaker behaves as a), I think, but b) is also > > > useful when you want to ignore a transient error. > > > > The b) part has already been discussed on the list and it's > > supposed to be implemented in lrmd. I still don't have the API > > defined, but thought about something like > > > > max-total-failures (how many times a monitor may fail) > > max-consecutive-failures (how many times in a row a monitor may fail) > > > > These should probably be attributes defined on the monitor > > operation level. > > The "ignore failure reports" clashes a bit with the "react to failures > ASAP" requirement. > > It is my belief that this should be handled by the RA, not in the LRM > nor the CRM. The monitor op implementation is the place to handle this. > > Beyond that, I strongly feel that "transient errors" are a bad > foundation to build clusters on.
Of course, all that is right. However, there are some situations where we could bend the rules. I'm not sure what Keisuke-san had in mind, but for example one could be more forgiving when monitoring certain stonith resources. Thank, Dejan _______________________________________________ Pacemaker mailing list [email protected] http://list.clusterlabs.org/mailman/listinfo/pacemaker
