I have a 2 node cluster with no-quorum-policy=ignore. I call these nodes as
node-0 and node-1. In addition, I have two cluster resources in a group; an
IP-address and an OCF script.
Normally these resources are active on node-0. However when I bounce pacemaker
on node-1 (service pacemaker stop followed by service pacemaker start), the OCF
resource gets bounced on node-0, which is unexpected and causing problems for
my application. In the log messages I see that monitor has failed with "unknown
error", leading to "resource is active on 2 nodes" error and the recovery
procedure then bounces the OCF resource. But when I manually run monitor on my
OCF script, return value is always either OCF_SUCCESS(0) or OCF_NOT_RUNNING(7)
I am using following versions of the software
Pacemaker version: 1.1.10
Corosync version: 1-4.1-15
OS: CentOS 6.4
What am I doing wrong?
Below I am including the cib config and corresponding log messages
<cib epoch="10" num_updates="94" admin_epoch="0" validate-with="pacemaker-1.2"
cib-last-written="Tue Jan 7 18:11:58 2014" update-origin="gol-5-7-0"
update-client="cibadmin" crm_feature_set="3.0.7" have-quorum="1"
dc-uuid="gol-5-7-0">
<configuration>
<crm_config>
<cluster_property_set id="cib-bootstrap-options">
<nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
value="1.1.10-1.el6_4.4-368c726"/>
<nvpair id="cib-bootstrap-options-cluster-infrastructure"
name="cluster-infrastructure" value="cman"/>
<nvpair id="cib-bootstrap-options-stonith-enabled"
name="stonith-enabled" value="false"/>
<nvpair id="cib-bootstrap-options-no-quorum-policy"
name="no-quorum-policy" value="ignore"/>
<nvpair id="cib-bootstrap-options-migration-threshold"
name="migration-threshold" value="3"/>
</cluster_property_set>
</crm_config>
<nodes>
<node id="gol-5-7-6" uname="gol-5-7-6"/>
<node id="gol-5-7-0" uname="gol-5-7-0"/>
</nodes>
<resources>
<group id="Group">
<primitive class="ocf" id="FAILOVER-INTER" provider="heartbeat"
type="IPaddr2">
<instance_attributes id="FAILOVER-INTER-instance_attributes">
<nvpair id="FAILOVER-INTER-instance_attributes-ip" name="ip"
value="10.20.7.190"/>
<nvpair id="FAILOVER-INTER-instance_attributes-nic" name="nic"
value="eth1"/>
<nvpair id="FAILOVER-INTER-instance_attributes-cidr_netmask"
name="cidr_netmask" value="14"/>
</instance_attributes>
<operations>
<op id="FAILOVER-INTER-monitor-interval-5s" interval="5s"
name="monitor"/>
</operations>
</primitive>
<primitive class="ocf" id="GOL-HA" provider="redhat" type="script.sh">
<instance_attributes id="GOL-HA-instance_attributes">
<nvpair id="GOL-HA-instance_attributes-name" name="name"
value="gol-ha"/>
<nvpair id="GOL-HA-instance_attributes-file" name="file"
value="/etc/init.d/gol-ha"/>
</instance_attributes>
<operations>
<op id="GOL-HA-monitor-interval-60s" interval="60s" name="monitor"/>
</operations>
</primitive>
</group>
</resources>
<constraints/>
<rsc_defaults>
<meta_attributes id="rsc_defaults-options">
<nvpair id="rsc_defaults-options-resource-stickiness"
name="resource-stickiness" value="100"/>
</meta_attributes>
</rsc_defaults>
</configuration>
Corresponding Log messages
Feb 04 11:27:29 corosync [TOTEM ] A processor joined or left the membership and
a new membership was formed.
Feb 04 11:27:29 corosync [QUORUM] Members[2]: 1 2
Feb 04 11:27:29 corosync [QUORUM] Members[2]: 1 2
Feb 04 11:27:29 [45168] gol-5-7-0 crmd: notice: crm_update_peer_state:
cman_event_callback: Node gol-5-7-6[2] - state is now member (was lost)
Feb 04 11:27:29 corosync [CPG ] chosen downlist: sender r(0) ip(172.16.0.2) ;
members(old:1 left:0)
Feb 04 11:27:29 corosync [MAIN ] Completed service synchronization, ready to
provide service.
Feb 04 11:27:36 [45168] gol-5-7-0 crmd: notice: do_state_transition:
State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN
cause=C_FSA_INTERNAL origin=peer_update_callback ]
Feb 04 11:27:38 [45166] gol-5-7-0 attrd: notice: attrd_local_callback:
Sending full refresh (origin=crmd)
Feb 04 11:27:38 [45166] gol-5-7-0 attrd: notice: attrd_trigger_update:
Sending flush op to all hosts for: fail-count-GOL-HA (5)
Feb 04 11:27:38 [45166] gol-5-7-0 attrd: notice: attrd_trigger_update:
Sending flush op to all hosts for: last-failure-GOL-HA (1391444085)
Feb 04 11:27:38 [45166] gol-5-7-0 attrd: notice: attrd_trigger_update:
Sending flush op to all hosts for: probe_complete (true)
Feb 04 11:27:38 [45167] gol-5-7-0 pengine: notice: unpack_config: On
loss of CCM Quorum: Ignore
Feb 04 11:27:38 [45167] gol-5-7-0 pengine: warning: unpack_rsc_op:
Processing failed op monitor for GOL-HA on gol-5-7-0: unknown error (1)
Feb 04 11:27:38 [45167] gol-5-7-0 pengine: notice: process_pe_message:
Calculated Transition 1825: /var/lib/pacemaker/pengine/pe-input-45.bz2
Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: te_rsc_command:
Initiating action 7: monitor FAILOVER-INTER_monitor_0 on gol-5-7-6
Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: te_rsc_command:
Initiating action 8: monitor GOL-HA_monitor_0 on gol-5-7-6
Feb 04 11:27:38 [45168] gol-5-7-0 crmd: warning: status_from_rc:
Action 8 (GOL-HA_monitor_0) on gol-5-7-6 failed (target: 7 vs. rc: 1): Error
Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: te_rsc_command:
Initiating action 6: probe_complete probe_complete on gol-5-7-6 - no waiting
Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: run_graph:
Transition 1825 (Complete=3, Pending=0, Fired=0, Skipped=1, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-45.bz2): Stopped
Feb 04 11:27:38 [45167] gol-5-7-0 pengine: notice: unpack_config: On
loss of CCM Quorum: Ignore
Feb 04 11:27:38 [45167] gol-5-7-0 pengine: warning: unpack_rsc_op:
Processing failed op monitor for GOL-HA on gol-5-7-0: unknown error (1)
Feb 04 11:27:38 [45167] gol-5-7-0 pengine: warning: unpack_rsc_op:
Processing failed op monitor for GOL-HA on gol-5-7-6: unknown error (1)
Feb 04 11:27:38 [45167] gol-5-7-0 pengine: error: native_create_actions:
Resource GOL-HA (ocf::script.sh) is active on 2 nodes attempting recovery
Feb 04 11:27:38 [45167] gol-5-7-0 pengine: notice: LogActions: Recover
GOL-HA (Started gol-5-7-0)
Feb 04 11:27:38 [45167] gol-5-7-0 pengine: error: process_pe_message:
Calculated Transition 1826: /var/lib/pacemaker/pengine/pe-error-3.bz2
Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: te_rsc_command:
Initiating action 10: stop GOL-HA_stop_0 on gol-5-7-0 (local)
Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: te_rsc_command:
Initiating action 3: stop GOL-HA_stop_0 on gol-5-7-6
Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: te_rsc_command:
Initiating action 7: probe_complete probe_complete on gol-5-7-6 - no waiting
Feb 04 11:27:39 [45168] gol-5-7-0 crmd: notice: process_lrm_event:
LRM operation GOL-HA_stop_0 (call=111, rc=0, cib-update=1953, confirmed=true) ok
Feb 04 11:27:39 [45168] gol-5-7-0 crmd: notice: te_rsc_command:
Initiating action 11: start GOL-HA_start_0 on gol-5-7-0 (local)
Feb 04 11:27:40 [45168] gol-5-7-0 crmd: notice: process_lrm_event:
LRM operation GOL-HA_start_0 (call=115, rc=0, cib-update=1954, confirmed=true)
ok
Feb 04 11:27:40 [45168] gol-5-7-0 crmd: notice: te_rsc_command:
Initiating action 1: monitor GOL-HA_monitor_60000 on gol-5-7-0 (local)
Feb 04 11:27:40 [45168] gol-5-7-0 crmd: notice: process_lrm_event:
LRM operation GOL-HA_monitor_60000 (call=118, rc=0, cib-update=1955,
confirmed=false) ok
Feb 04 11:27:40 [45168] gol-5-7-0 crmd: notice: run_graph:
Transition 1826 (Complete=10, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-error-3.bz2): Complete
Feb 04 11:27:40 [45168] gol-5-7-0 crmd: notice: do_state_transition:
State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
_______________________________________________
Pacemaker mailing list: [email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org