Hello, when the monitor action for a resource times out I think its failcount is incremented by 1, correct? If so, suppose the next monitor action succeeds, does the failcount value automatically resets to zero or does it stay to 1? In the last case, is there any way to configure the cluster to automatically reset it when the following scheduled monitor completes ok? or is it a job for the administrator to monitor failcount (eg with crm_mon output) and then cleanup resource after checking all is ok and resetting so the failcount value?
I ask because on a SLES 11 SP2 cluster from which I only got the logs I have these kind of messages: Jun 15 00:01:18 node2 pengine: [4330]: notice: common_apply_stickiness: my_resource can fail 1 more times on node2 before being forced off ... Jun 15 03:38:42 node2 lrmd: [4328]: WARN: my_resource:monitor process (PID 27120) timed out (try 1). Killing with signal SIGTERM (15). Jun 15 03:38:42 node2 lrmd: [4328]: WARN: operation monitor[29] on my_resource for client 4331: pid 27120 timed out Jun 15 03:38:42 node2 crmd: [4331]: ERROR: process_lrm_event: LRM operation my_resource_monitor_30000 (29) Timed Out (timeout=60000ms) Jun 15 03:38:42 node2 crmd: [4331]: info: process_graph_event: Detected action my_resource_monitor_30000 from a different transition: 40696 vs. 51755 Jun 15 03:38:42 node2 crmd: [4331]: WARN: update_failcount: Updating failcount for my_resource on node2 after failed monitor: rc=-2 (update=value++, time=1402796322) ... Jun 15 03:38:42 node2 attrd: [4329]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-my_resource (3) .. Jun 15 03:38:42 node2 attrd: [4329]: notice: attrd_perform_update: Sent update 52: fail-count-my_resource=3 .. Jun 15 03:38:42 node2 pengine: [4330]: WARN: common_apply_stickiness: Forcing my_resource away from node2 after 3 failures (max=3) SO it seems at midnight the resource already was with a failcount of 2 (perhaps caused by problems happened weeks ago..?) and then at 03:38 got a timeout on monitoring its state and was relocated... pacemaker is at 1.1.6-1.27.26 and I see this list message that seems related: http://oss.clusterlabs.org/pipermail/pacemaker/2012-August/015076.html Is it perhaps only a matter of setting meta parameter failure-timeout as explained in High AvailabilityGuide: https://www.suse.com/documentation/sle_ha/singlehtml/book_sleha/book_sleha.html#sec.ha.config.hawk.rsc in particular 5.3.6. Specifying Resource Failover Nodes ... 4. If you want to automatically expire the failcount for a resource, add the failure-timeout meta attribute to the resource as described in Procedure 5.4: Adding Primitive Resources, Step 7 and enter a Value for the failure-timeout. .. ? Thanks in advance, Gianluca
_______________________________________________ Pacemaker mailing list: [email protected] http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
