On 09/24/2016 01:12 AM, Ken Gaillot wrote: > On 09/22/2016 05:58 PM, Andrew Beekhof wrote: >> >> On Fri, Sep 23, 2016 at 1:58 AM, Ken Gaillot <[email protected] >> <mailto:[email protected]>> wrote: >> >> On 09/22/2016 09:53 AM, Jan Pokorný wrote: >> > On 22/09/16 08:42 +0200, Kristoffer Grönlund wrote: >> >> Ken Gaillot <[email protected] <mailto:[email protected]>> writes: >> >> >> >>> I'm not saying it's a bad idea, just that it's more complicated than >> it >> >>> first sounds, so it's worth thinking through the implications. >> >> >> >> Thinking about it and looking at how complicated it gets, maybe what >> >> you'd really want, to make it clearer for the user, is the ability to >> >> explicitly configure the behavior, either globally or per-resource. So >> >> instead of having to tweak a set of variables that interact in complex >> >> ways, you'd configure something like rule expressions, >> >> >> >> <on_fail> >> >> <restart repeat="3" /> >> >> <migrate timeout="60s" /> >> >> <fence/> >> >> </on_fail> >> >> >> >> So, try to restart the service 3 times, if that fails migrate the >> >> service, if it still fails, fence the node. >> >> >> >> (obviously the details and XML syntax are just an example) >> >> >> >> This would then replace on-fail, migration-threshold, etc. >> > >> > I must admit that in previous emails in this thread, I wasn't able to >> > follow during the first pass, which is not the case with this >> procedural >> > (sequence-ordered) approach. Though someone can argue it doesn't take >> > type of operation into account, which might again open the door for >> > non-obvious interactions. >> >> "restart" is the only on-fail value that it makes sense to escalate. >> >> block/stop/fence/standby are final. Block means "don't touch the >> resource again", so there can't be any further response to failures. >> Stop/fence/standby move the resource off the local node, so failure >> handling is reset (there are 0 failures on the new node to begin with). >> >> "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures >> then migrate", but I can't think of a real-world situation where that >> makes sense, >> >> >> really? >> >> it is not uncommon to hear "i know its failed, but i dont want the >> cluster to do anything until its _really_ failed" > Hmm, I guess that would be similar to how monitoring systems such as > nagios can be configured to send an alert only if N checks in a row > fail. That's useful where transient outages (e.g. a webserver hitting > its request limit) are acceptable for a short time. > > I'm not sure that's translatable to Pacemaker. Pacemaker's error count > is not "in a row" but "since the count was last cleared". > > "Ignore up to three monitor failures if they occur in a row [or, within > 10 minutes?], then try soft recovery for the next two monitor failures, > then ban this node for the next monitor failure." Not sure being able to > say that is worth the complexity. That is the reason why I suggested to think of a solution that comes up with a certain number of statistics in environment variables and leaves the final logic to be scripted in the RA or an additional script. >> and it would be a significant re-implementation of "ignore" >> (which currently ignores the state of having failed, as opposed to a >> particular instance of failure). >> >> >> agreed >> >> >> >> What the interface needs to express is: "If this operation fails, >> optionally try a soft recovery [always stop+start], but if <N> failures >> occur on the same node, proceed to a [configurable] hard recovery". >> >> And of course the interface will need to be different depending on how >> certain details are decided, e.g. whether any failures count toward <N> >> or just failures of one particular operation type, and whether the hard >> recovery type can vary depending on what operation failed. > _______________________________________________ > Users mailing list: [email protected] > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
_______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
