Hi everybody, Currently, Pacemaker's on-fail property allows you to configure how the cluster reacts to operation failures. The default "restart" means try to restart on the same node, optionally moving to another node once migration-threshold is reached. Other possibilities are "ignore", "block", "stop", "fence", and "standby".
Occasionally, we get requests to have something like migration-threshold for values besides restart. For example, try restarting the resource on the same node 3 times, then fence. I'd like to get your feedback on two alternative approaches we're considering. ### Our first proposed approach would add a new hard-fail-threshold operation property. If specified, the cluster would first try restarting the resource on the same node, before doing the on-fail handling. For example, you could configure a promote operation with hard-fail-threshold=3 and on-fail=fence, to fence the node after 3 failures. One point that's not settled is whether failures of *any* operation would count toward the 3 failures (which is how migration-threshold works now), or only failures of the specified operation. Currently, if a start fails (but is retried successfully), then a promote fails (but is retried successfully), then a monitor fails, the resource will move to another node if migration-threshold=3. We could keep that behavior with hard-fail-threshold, or only count monitor failures toward monitor's hard-fail-threshold. Each alternative has advantages and disadvantages. ### The second proposed approach would add a new on-restart-fail resource property. Same as now, on-fail set to anything but restart would be done immediately after the first failure. A new value, "ban", would immediately move the resource to another node. (on-fail=ban would behave like on-fail=restart with migration-threshold=1.) When on-fail=restart, and restarting on the same node doesn't work, the cluster would do the on-restart-fail handling. on-restart-fail would allow the same values as on-fail (minus "restart"), and would default to "ban". So, if you want to fence immediately after any promote failure, you would still configure on-fail=fence; if you want to try restarting a few times first, you would configure on-fail=restart and on-restart-fail=fence. This approach keeps the current threshold behavior -- failures of any operation count toward the threshold. We'd rename migration-threshold to something like hard-fail-threshold, since it would apply to more than just migration, but unlike the first approach, it would stay a resource property. ### Comparing the two approaches, the first is more flexible, but also more complex and potentially confusing. With either approach, we would deprecate the start-failure-is-fatal cluster property. start-failure-is-fatal=true would be equivalent to hard-fail-threshold=1 with the first approach, and on-fail=ban with the second approach. This would be both simpler and more useful -- it allows the value to be set differently per resource. -- Ken Gaillot <[email protected]> _______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
