On Wed, Sep 21, 2016 at 6:25 AM, Ken Gaillot <[email protected]> wrote:
> Hi everybody, > > Currently, Pacemaker's on-fail property allows you to configure how the > cluster reacts to operation failures. The default "restart" means try to > restart on the same node, optionally moving to another node once > migration-threshold is reached. Other possibilities are "ignore", > "block", "stop", "fence", and "standby". > > Occasionally, we get requests to have something like migration-threshold > for values besides restart. For example, try restarting the resource on > the same node 3 times, then fence. > > I'd like to get your feedback on two alternative approaches we're > considering. > > ### > > Our first proposed approach would add a new hard-fail-threshold > operation property. If specified, the cluster would first try restarting > the resource on the same node, Well, just as now, it would be _allowed_ to start on the same node, but this is not guaranteed. > before doing the on-fail handling. > > For example, you could configure a promote operation with > hard-fail-threshold=3 and on-fail=fence, to fence the node after 3 > failures. > One point that's not settled is whether failures of *any* operation > would count toward the 3 failures (which is how migration-threshold > works now), or only failures of the specified operation. > I think if hard-fail-threshold is per-op, then only failures of that operation should count. > > Currently, if a start fails (but is retried successfully), then a > promote fails (but is retried successfully), then a monitor fails, the > resource will move to another node if migration-threshold=3. We could > keep that behavior with hard-fail-threshold, or only count monitor > failures toward monitor's hard-fail-threshold. Each alternative has > advantages and disadvantages. > > ### > > The second proposed approach would add a new on-restart-fail resource > property. > > Same as now, on-fail set to anything but restart would be done > immediately after the first failure. A new value, "ban", would > immediately move the resource to another node. (on-fail=ban would behave > like on-fail=restart with migration-threshold=1.) > > When on-fail=restart, and restarting on the same node doesn't work, the > cluster would do the on-restart-fail handling. on-restart-fail would > allow the same values as on-fail (minus "restart"), and would default to > "ban". I do wish you well tracking "is this a restart" across demote -> stop -> start -> promote in 4 different transitions :-) > > So, if you want to fence immediately after any promote failure, you > would still configure on-fail=fence; if you want to try restarting a few > times first, you would configure on-fail=restart and on-restart-fail=fence. > > This approach keeps the current threshold behavior -- failures of any > operation count toward the threshold. We'd rename migration-threshold to > something like hard-fail-threshold, since it would apply to more than > just migration, but unlike the first approach, it would stay a resource > property. > > ### > > Comparing the two approaches, the first is more flexible, but also more > complex and potentially confusing. > More complex to implement or more complex to configure? > > With either approach, we would deprecate the start-failure-is-fatal > cluster property. start-failure-is-fatal=true would be equivalent to > hard-fail-threshold=1 with the first approach, and on-fail=ban with the > second approach. This would be both simpler and more useful -- it allows > the value to be set differently per resource. > -- > Ken Gaillot <[email protected]> > > _______________________________________________ > Users mailing list: [email protected] > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
_______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
