Hi
rpm -qa | grep pace
pacemaker-libs-1.1.10-14.el6_5.3.x86_64
pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64
pacemaker-cli-1.1.10-14.el6_5.3.x86_64
pacemaker-1.1.10-14.el6_5.3.x86_64
centos 6.5
I have a 2 node cluster
pcs config
Cluster Name: ybrp
Corosync Nodes:
Pacemaker Nodes:
wwwrp1 wwwrp2
Resources:
Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=10.32.43.50 cidr_netmask=24 nic=eth0
clusterip_hash=sourceip-sourceport
Meta Attrs: stickiness=0,migration-threshold=3,failure-timeout=600s
Operations: start on-fail=restart interval=0s timeout=60s
(ybrpip-start-interval-0s)
monitor on-fail=restart interval=5s timeout=20s
(ybrpip-monitor-interval-5s)
stop interval=0s timeout=60s (ybrpip-stop-interval-0s)
Clone: ybrpstat-clone
Meta Attrs: globally-unique=false clone-max=2 clone-node-max=1
Resource: ybrpstat (class=ocf provider=yb type=proxy)
Operations: monitor on-fail=restart interval=5s timeout=20s
(ybrpstat-monitor-interval-5s)
Stonith Devices:
Fencing Levels:
Location Constraints:
Ordering Constraints:
start ybrpstat-clone then start ybrpip (Mandatory)
(id:order-ybrpstat-clone-ybrpip-mandatory)
Colocation Constraints:
ybrpip with ybrpstat-clone (INFINITY)
(id:colocation-ybrpip-ybrpstat-clone-INFINITY)
Cluster Properties:
cluster-infrastructure: cman
dc-version: 1.1.10-14.el6_5.3-368c726
last-lrm-refresh: 1414629002
no-quorum-policy: ignore
stonith-enabled: false
Basically 1 node died (wwwrp1) hardware failure
I can see in the logs the cluster wants to bring the IP address over and it
seems to do it but
## this seems to be the ip address moving
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Adding inet address
10.32.43.50/24 with broadcast address 10.32.43.255 to device eth0
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO:
/usr/libexec/heartbeat/send_arp -i 200 -r 5 -p
/var/run/resource-agents/send_arp-10.32.43.50 eth0 10.32.43.50 auto not_used
not_used
## this seems to be where it checks other stuff
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM operation
ybrpip_start_0 (call=13476, rc=0, cib-update=11762, confirmed=true) ok
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating action
8: monitor ybrpip_monitor_5000 on wwwrp2 (local)
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM operation
ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763, confirmed=false) ok
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: run_graph: Transition 6828
(Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
# this is where monitor times out, but it doesn't look like 20000ms .. the
initial try was at Oct 30 01:19:43
Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning: child_timeout_callback:
ybrpstat_monitor_5000 process (PID 25712) timed out
Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning: operation_finished:
ybrpstat_monitor_5000:25712 - timed out after 20000ms
Oct 30 01:19:44 wwwrp2 crmd[2462]: error: process_lrm_event: LRM operation
ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms)
Oct 30 01:19:44 wwwrp2 crmd[2462]: warning: update_failcount: Updating
failcount for ybrpstat on wwwrp2 after failed monitor: rc=1 (update=value++,
time=1414592384)
I'm guessing because it timed out it went into failed mode. I need to know why
it timed out. The script has never timed out before or in testing...
Am I reading this right. The reason the resource didn't fail over (ip address)
was because there was no ybrpstat running on wwwrp2 and the reason for that was
the monitor action failed/timed out
Thanks
Alex
=== logs ==
Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: unpack_config: On loss of CCM
Quorum: Ignore
Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing
failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Start
ybrpip#011(wwwrp1)
Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Recover
ybrpstat:0#011(Started wwwrp1)
Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: process_pe_message: Calculated
Transition 6827: /var/lib/pacemaker/pengine/pe-input-2.bz2
Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: unpack_config: On loss of CCM
Quorum: Ignore
Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing
failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: common_apply_stickiness:
Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: common_apply_stickiness:
Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Start
ybrpip#011(wwwrp2)
Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Stop
ybrpstat:0#011(wwwrp1)
Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: process_pe_message: Calculated
Transition 6828: /var/lib/pacemaker/pengine/pe-input-3.bz2
Oct 30 01:19:42 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating action
7: start ybrpip_start_0 on wwwrp2 (local)
Oct 30 01:19:42 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating action
1: stop ybrpstat_stop_0 on wwwrp1
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Adding inet address
10.32.43.50/24 with broadcast address 10.32.43.255 to device eth0
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO:
/usr/libexec/heartbeat/send_arp -i 200 -r 5 -p
/var/run/resource-agents/send_arp-10.32.43.50 eth0 10.32.43.50 auto not_used
not_used
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM operation
ybrpip_start_0 (call=13476, rc=0, cib-update=11762, confirmed=true) ok
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating action
8: monitor ybrpip_monitor_5000 on wwwrp2 (local)
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM operation
ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763, confirmed=false) ok
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: run_graph: Transition 6828
(Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete
Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning: child_timeout_callback:
ybrpstat_monitor_5000 process (PID 25712) timed out
Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning: operation_finished:
ybrpstat_monitor_5000:25712 - timed out after 20000ms
Oct 30 01:19:44 wwwrp2 crmd[2462]: error: process_lrm_event: LRM operation
ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms)
Oct 30 01:19:44 wwwrp2 crmd[2462]: warning: update_failcount: Updating
failcount for ybrpstat on wwwrp2 after failed monitor: rc=1 (update=value++,
time=1414592384)
Oct 30 01:19:44 wwwrp2 crmd[2462]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-ybrpstat (1)
Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_perform_update: Sent update
12543: fail-count-ybrpstat=1
Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_trigger_update: Sending
flush op to all hosts for: last-failure-ybrpstat (1414592384)
Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_perform_update: Sent update
12545: last-failure-ybrpstat=1414592384
Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: unpack_config: On loss of CCM
Quorum: Ignore
Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing
failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing
failed op monitor for ybrpstat:0 on wwwrp2: unknown error (1)
Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness:
Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness:
Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Restart
ybrpip#011(Started wwwrp2)
Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Recover
ybrpstat:0#011(Started wwwrp2)
Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: process_pe_message: Calculated
Transition 6829: /var/lib/pacemaker/pengine/pe-input-4.bz2
Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: unpack_config: On loss of CCM
Quorum: Ignore
Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing
failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing
failed op monitor for ybrpstat:0 on wwwrp2: unknown error (1)
Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness:
Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness:
Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Restart
ybrpip#011(Started wwwrp2)
Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Recover
ybrpstat:0#011(Started wwwrp2)
Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: process_pe_message: Calculated
Transition 6830: /var/lib/pacemaker/pengine/pe-input-5.bz2
_______________________________________________
Pacemaker mailing list: [email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org