[Pacemaker] Help understanding what went wrong

Alex Samad - Yieldbroker Wed, 29 Oct 2014 17:50:12 -0700

Hi


rpm -qa | grep pace
pacemaker-libs-1.1.10-14.el6_5.3.x86_64
pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64
pacemaker-cli-1.1.10-14.el6_5.3.x86_64
pacemaker-1.1.10-14.el6_5.3.x86_64

centos 6.5


I have a 2 node cluster
pcs config 
Cluster Name: ybrp
Corosync Nodes:
 
Pacemaker Nodes:
 wwwrp1 wwwrp2 

Resources: 
 Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=10.32.43.50 cidr_netmask=24 nic=eth0 
clusterip_hash=sourceip-sourceport 
  Meta Attrs: stickiness=0,migration-threshold=3,failure-timeout=600s 
  Operations: start on-fail=restart interval=0s timeout=60s 
(ybrpip-start-interval-0s)
              monitor on-fail=restart interval=5s timeout=20s 
(ybrpip-monitor-interval-5s)
              stop interval=0s timeout=60s (ybrpip-stop-interval-0s)
 Clone: ybrpstat-clone
  Meta Attrs: globally-unique=false clone-max=2 clone-node-max=1 
  Resource: ybrpstat (class=ocf provider=yb type=proxy)
   Operations: monitor on-fail=restart interval=5s timeout=20s 
(ybrpstat-monitor-interval-5s)

Stonith Devices: 
Fencing Levels: 

Location Constraints:
Ordering Constraints:
  start ybrpstat-clone then start ybrpip (Mandatory) 
(id:order-ybrpstat-clone-ybrpip-mandatory)
Colocation Constraints:
  ybrpip with ybrpstat-clone (INFINITY) 
(id:colocation-ybrpip-ybrpstat-clone-INFINITY)

Cluster Properties:
 cluster-infrastructure: cman
 dc-version: 1.1.10-14.el6_5.3-368c726
 last-lrm-refresh: 1414629002
 no-quorum-policy: ignore
 stonith-enabled: false



Basically 1 node died (wwwrp1) hardware failure

I can see in the logs the cluster wants to bring the IP address over and it 
seems to do it but



## this seems to be the ip address moving
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Adding inet address 
10.32.43.50/24 with broadcast address 10.32.43.255 to device eth0
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: 
/usr/libexec/heartbeat/send_arp -i 200 -r 5 -p 
/var/run/resource-agents/send_arp-10.32.43.50 eth0 10.32.43.50 auto not_used 
not_used


## this seems to be where it checks other stuff
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: process_lrm_event: LRM operation 
ybrpip_start_0 (call=13476, rc=0, cib-update=11762, confirmed=true) ok
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: te_rsc_command: Initiating action 
8: monitor ybrpip_monitor_5000 on wwwrp2 (local)
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: process_lrm_event: LRM operation 
ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763, confirmed=false) ok
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: run_graph: Transition 6828 
(Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]

# this is where monitor times out, but it doesn't look like 20000ms .. the 
initial try was at Oct 30 01:19:43
Oct 30 01:19:44 wwwrp2 lrmd[2459]:  warning: child_timeout_callback: 
ybrpstat_monitor_5000 process (PID 25712) timed out
Oct 30 01:19:44 wwwrp2 lrmd[2459]:  warning: operation_finished: 
ybrpstat_monitor_5000:25712 - timed out after 20000ms
Oct 30 01:19:44 wwwrp2 crmd[2462]:    error: process_lrm_event: LRM operation 
ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms)
Oct 30 01:19:44 wwwrp2 crmd[2462]:  warning: update_failcount: Updating 
failcount for ybrpstat on wwwrp2 after failed monitor: rc=1 (update=value++, 
time=1414592384)

I'm guessing because it timed out it went into failed mode. I need to know why 
it timed out. The script has never timed out before or in testing...

Am I reading this right.  The reason the resource didn't fail over (ip address) 
was because there was no ybrpstat running on wwwrp2 and the reason for that was 
the monitor action failed/timed out

Thanks
Alex


=== logs ==

Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Oct 30 01:19:42 wwwrp2 pengine[2461]:  warning: unpack_rsc_op: Processing 
failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: LogActions: Start   
ybrpip#011(wwwrp1)
Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: LogActions: Recover 
ybrpstat:0#011(Started wwwrp1)
Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: process_pe_message: Calculated 
Transition 6827: /var/lib/pacemaker/pengine/pe-input-2.bz2
Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Oct 30 01:19:42 wwwrp2 pengine[2461]:  warning: unpack_rsc_op: Processing 
failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
Oct 30 01:19:42 wwwrp2 pengine[2461]:  warning: common_apply_stickiness: 
Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:42 wwwrp2 pengine[2461]:  warning: common_apply_stickiness: 
Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: LogActions: Start   
ybrpip#011(wwwrp2)
Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: LogActions: Stop    
ybrpstat:0#011(wwwrp1)
Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: process_pe_message: Calculated 
Transition 6828: /var/lib/pacemaker/pengine/pe-input-3.bz2
Oct 30 01:19:42 wwwrp2 crmd[2462]:   notice: te_rsc_command: Initiating action 
7: start ybrpip_start_0 on wwwrp2 (local)
Oct 30 01:19:42 wwwrp2 crmd[2462]:   notice: te_rsc_command: Initiating action 
1: stop ybrpstat_stop_0 on wwwrp1
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Adding inet address 
10.32.43.50/24 with broadcast address 10.32.43.255 to device eth0
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: 
/usr/libexec/heartbeat/send_arp -i 200 -r 5 -p 
/var/run/resource-agents/send_arp-10.32.43.50 eth0 10.32.43.50 auto not_used 
not_used
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: process_lrm_event: LRM operation 
ybrpip_start_0 (call=13476, rc=0, cib-update=11762, confirmed=true) ok
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: te_rsc_command: Initiating action 
8: monitor ybrpip_monitor_5000 on wwwrp2 (local)
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: process_lrm_event: LRM operation 
ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763, confirmed=false) ok
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: run_graph: Transition 6828 
(Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 30 01:19:44 wwwrp2 lrmd[2459]:  warning: child_timeout_callback: 
ybrpstat_monitor_5000 process (PID 25712) timed out
Oct 30 01:19:44 wwwrp2 lrmd[2459]:  warning: operation_finished: 
ybrpstat_monitor_5000:25712 - timed out after 20000ms
Oct 30 01:19:44 wwwrp2 crmd[2462]:    error: process_lrm_event: LRM operation 
ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms)
Oct 30 01:19:44 wwwrp2 crmd[2462]:  warning: update_failcount: Updating 
failcount for ybrpstat on wwwrp2 after failed monitor: rc=1 (update=value++, 
time=1414592384)
Oct 30 01:19:44 wwwrp2 crmd[2462]:   notice: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Oct 30 01:19:44 wwwrp2 attrd[2460]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: fail-count-ybrpstat (1)
Oct 30 01:19:44 wwwrp2 attrd[2460]:   notice: attrd_perform_update: Sent update 
12543: fail-count-ybrpstat=1
Oct 30 01:19:44 wwwrp2 attrd[2460]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: last-failure-ybrpstat (1414592384)
Oct 30 01:19:44 wwwrp2 attrd[2460]:   notice: attrd_perform_update: Sent update 
12545: last-failure-ybrpstat=1414592384
Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Oct 30 01:19:44 wwwrp2 pengine[2461]:  warning: unpack_rsc_op: Processing 
failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
Oct 30 01:19:44 wwwrp2 pengine[2461]:  warning: unpack_rsc_op: Processing 
failed op monitor for ybrpstat:0 on wwwrp2: unknown error (1)
Oct 30 01:19:44 wwwrp2 pengine[2461]:  warning: common_apply_stickiness: 
Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:44 wwwrp2 pengine[2461]:  warning: common_apply_stickiness: 
Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: LogActions: Restart 
ybrpip#011(Started wwwrp2)
Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: LogActions: Recover 
ybrpstat:0#011(Started wwwrp2)
Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: process_pe_message: Calculated 
Transition 6829: /var/lib/pacemaker/pengine/pe-input-4.bz2
Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Oct 30 01:19:44 wwwrp2 pengine[2461]:  warning: unpack_rsc_op: Processing 
failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
Oct 30 01:19:44 wwwrp2 pengine[2461]:  warning: unpack_rsc_op: Processing 
failed op monitor for ybrpstat:0 on wwwrp2: unknown error (1)
Oct 30 01:19:44 wwwrp2 pengine[2461]:  warning: common_apply_stickiness: 
Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:44 wwwrp2 pengine[2461]:  warning: common_apply_stickiness: 
Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: LogActions: Restart 
ybrpip#011(Started wwwrp2)
Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: LogActions: Recover 
ybrpstat:0#011(Started wwwrp2)
Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: process_pe_message: Calculated 
Transition 6830: /var/lib/pacemaker/pengine/pe-input-5.bz2

_______________________________________________
Pacemaker mailing list: [email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] Help understanding what went wrong

Reply via email to