Re: [Pacemaker] "Hard error" Preventing a service to restart

Andrei Borzenkov Thu, 20 Nov 2014 19:54:04 -0800

В Thu, 20 Nov 2014 14:53:10 +0000
"Brusq, Jerome" <[email protected]> пишет:


> Dear all,
> 
> We are running corosync-1.4.1-4.el6.x86_64 on RHEL 6.2. We have a cluster 
> with 2 nodes that run as services a Virtual IP and around 20 custom LSB 
> scripts.
> We had some network issue 2 weeks ago and we got a strange  behavior.
> 
> On the logs below we can see that at 06 08:36:29 that the connection between 
> the nodes has been lost, then we see a strange log "pengine: [2621]: ERROR: 
> unpack_rsc_op: Hard error - 5-controller_last_failure_0 failed with rc=3: 
> Preventing 5-controller from re-starting on srv01".
> 
> Only one of our LSB script "5-controller"  got an "Hard error" and this 
> service got the status "unmanaged". This means that once the network got 
> back, the 2 nodes communicate again, and all services have been restarted , 
> excepted "5-controller".
> 
> Do you have an idea about what happened ?
> 

The error is returned by resource agent, pacemaker is just messenger.
You need to check LSB script why it does it.

> Thanks a lot,
> 
> Jerome
> 
> 
> Nov 06 08:31:35 srv01 cib: [2618]: info: cib_stats: Processed 71 operations 
> (5633.00us average, 0% utilization) in the last 10min
> Nov 06 08:36:28 corosync [TOTEM ] A processor failed, forming new 
> configuration.
> Nov 06 08:36:29 corosync [pcmk  ] notice: pcmk_peer_update: Transitional 
> membership event on ring 156: memb=1, new=0, lost=1
> Nov 06 08:36:29 corosync [pcmk  ] info: pcmk_peer_update: memb: srv01 
> 1661074698
> Nov 06 08:36:29 corosync [pcmk  ] info: pcmk_peer_update: lost: srv02 
> 1677851914
> Nov 06 08:36:29 corosync [pcmk  ] notice: pcmk_peer_update: Stable membership 
> event on ring 156: memb=1, new=0, lost=0
> Nov 06 08:36:29 corosync [pcmk  ] info: pcmk_peer_update: MEMB: srv01 
> 1661074698
> Nov 06 08:36:29 corosync [pcmk  ] info: ais_mark_unseen_peer_dead: Node srv02 
> was not seen in the previous transition
> Nov 06 08:36:29 corosync [pcmk  ] info: update_member: Node 1677851914/srv02 
> is now: lost
> Nov 06 08:36:29 corosync [pcmk  ] info: send_member_notification: Sending 
> membership update 156 to 2 children
> Nov 06 08:36:29 srv01 cib: [2618]: notice: ais_dispatch_message: Membership 
> 156: quorum lost
> Nov 06 08:36:29 srv01 cib: [2618]: info: crm_update_peer: Node srv02: 
> id=1677851914 state=lost (new) addr=r(0) ip(10.10.1.100)  votes=1 born=76 
> seen=84 proc=00000000000000000000000000111312
> Nov 06 08:36:29 corosync [TOTEM ] A processor joined or left the membership 
> and a new membership was formed.
> Nov 06 08:36:29 corosync [CPG   ] chosen downlist: sender r(0) ip(10.10.1.99) 
> ; members(old:2 left:1)
> Nov 06 08:36:29 corosync [MAIN  ] Completed service synchronization, ready to 
> provide service.
> Nov 06 08:36:29 srv01 crmd: [2622]: notice: ais_dispatch_message: Membership 
> 156: quorum lost
> Nov 06 08:36:29 srv01 crmd: [2622]: info: ais_status_callback: status: srv02 
> is now lost (was member)
> Nov 06 08:36:29 srv01 crmd: [2622]: info: crm_update_peer: Node srv02: 
> id=1677851914 state=lost (new) addr=r(0) ip(10.10.1.100)  votes=1 born=76 
> seen=84 proc=00000000000000000000000000111312
> Nov 06 08:36:29 srv01 crmd: [2622]: WARN: check_dead_member: Our DC node 
> (srv02) left the cluster
> Nov 06 08:36:29 srv01 crmd: [2622]: info: do_state_transition: State 
> transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL 
> origin=check_dead_member ]
> Nov 06 08:36:29 srv01 crmd: [2622]: info: update_dc: Unset DC srv02
> Nov 06 08:36:29 srv01 crmd: [2622]: info: do_state_transition: State 
> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC 
> cause=C_FSA_INTERNAL origin=do_election_check ]
> Nov 06 08:36:29 srv01 crmd: [2622]: info: do_te_control: Registering TE UUID: 
> ceab74dd-6a27-4eda-ba66-b63d77beaee5
> Nov 06 08:36:29 srv01 crmd: [2622]: info: set_graph_functions: Setting custom 
> graph functions
> Nov 06 08:36:29 srv01 crmd: [2622]: info: unpack_graph: Unpacked transition 
> -1: 0 actions in 0 synapses
> Nov 06 08:36:29 srv01 crmd: [2622]: info: do_dc_takeover: Taking over DC 
> status for this partition
> Nov 06 08:36:29 srv01 cib: [2618]: info: cib_process_readwrite: We are now in 
> R/W mode
> Nov 06 08:36:29 srv01 cib: [2618]: info: cib_process_request: Operation 
> complete: op cib_master for section 'all' (origin=local/crmd/709, 
> version=0.2504.157): ok (rc=0)
> Nov 06 08:36:29 srv01 cib: [2618]: info: cib_process_request: Operation 
> complete: op cib_modify for section cib (origin=local/crmd/710, 
> version=0.2504.158): ok (rc=0)
> Nov 06 08:36:29 srv01 cib: [2618]: info: cib_process_request: Operation 
> complete: op cib_modify for section crm_config (origin=local/crmd/712, 
> version=0.2504.159): ok (rc=0)
> Nov 06 08:36:29 srv01 crmd: [2622]: info: join_make_offer: Making join offers 
> based on membership 156
> Nov 06 08:36:29 srv01 crmd: [2622]: info: do_dc_join_offer_all: join-1: 
> Waiting on 1 outstanding join acks
> Nov 06 08:36:29 srv01 cib: [2618]: info: cib_process_request: Operation 
> complete: op cib_modify for section crm_config (origin=local/crmd/714, 
> version=0.2504.160): ok (rc=0)
> Nov 06 08:36:29 srv01 crmd: [2622]: info: ais_dispatch_message: Membership 
> 156: quorum still lost
> Nov 06 08:36:29 srv01 crmd: [2622]: info: crmd_ais_dispatch: Setting expected 
> votes to 2
> Nov 06 08:36:29 srv01 crmd: [2622]: info: update_dc: Set DC to srv01 (3.0.5)
> Nov 06 08:36:29 srv01 crmd: [2622]: info: config_query_callback: Shutdown 
> escalation occurs after: 1200000ms
> Nov 06 08:36:29 srv01 crmd: [2622]: info: config_query_callback: Checking for 
> expired actions every 180000ms
> Nov 06 08:36:29 srv01 crmd: [2622]: info: config_query_callback: Sending 
> expected-votes=2 to corosync
> Nov 06 08:36:29 srv01 crmd: [2622]: info: ais_dispatch_message: Membership 
> 156: quorum still lost
> Nov 06 08:36:29 srv01 cib: [2618]: info: cib_process_request: Operation 
> complete: op cib_modify for section crm_config (origin=local/crmd/717, 
> version=0.2504.161): ok (rc=0)
> Nov 06 08:36:29 srv01 crmd: [2622]: info: crmd_ais_dispatch: Setting expected 
> votes to 2
> Nov 06 08:36:29 srv01 cib: [2618]: info: cib_process_request: Operation 
> complete: op cib_modify for section crm_config (origin=local/crmd/720, 
> version=0.2504.162): ok (rc=0)
> Nov 06 08:36:29 srv01 crmd: [2622]: info: do_state_transition: State 
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED 
> cause=C_FSA_INTERNAL origin=check_join_state ]
> Nov 06 08:36:29 srv01 crmd: [2622]: info: do_state_transition: All 1 cluster 
> nodes responded to the join offer.
> Nov 06 08:36:29 srv01 crmd: [2622]: info: do_dc_join_finalize: join-1: 
> Syncing the CIB from srv01 to the rest of the cluster
> Nov 06 08:36:29 srv01 cib: [2618]: info: cib_process_request: Operation 
> complete: op cib_sync for section 'all' (origin=local/crmd/721, 
> version=0.2504.162): ok (rc=0)
> Nov 06 08:36:29 srv01 cib: [2618]: info: cib_process_request: Operation 
> complete: op cib_modify for section nodes (origin=local/crmd/722, 
> version=0.2504.163): ok (rc=0)
> Nov 06 08:36:30 srv01 crmd: [2622]: info: do_dc_join_ack: join-1: Updating 
> node state to member for srv01
> Nov 06 08:36:30 srv01 cib: [2618]: info: cib_process_request: Operation 
> complete: op cib_delete for section //node_state[@uname='srv01']/lrm 
> (origin=local/crmd/723, version=0.2504.164): ok (rc=0)
> Nov 06 08:36:30 srv01 crmd: [2622]: info: erase_xpath_callback: Deletion of 
> "//node_state[@uname='srv01']/lrm": ok (rc=0)
> Nov 06 08:36:30 srv01 crmd: [2622]: info: do_state_transition: State 
> transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED 
> cause=C_FSA_INTERNAL origin=check_join_state ]
> Nov 06 08:36:30 srv01 crmd: [2622]: info: do_state_transition: All 1 cluster 
> nodes are eligible to run resources.
> Nov 06 08:36:30 srv01 crmd: [2622]: info: do_dc_join_final: Ensuring DC, 
> quorum and node attributes are up-to-date
> Nov 06 08:36:30 srv01 crmd: [2622]: info: crm_update_quorum: Updating quorum 
> status to false (call=727)
> Nov 06 08:36:30 srv01 crmd: [2622]: info: abort_transition_graph: 
> do_te_invoke:162 - Triggered transition abort (complete=1) : Peer Cancelled
> Nov 06 08:36:30 srv01 crmd: [2622]: info: do_pe_invoke: Query 728: Requesting 
> the current CIB: S_POLICY_ENGINE
> Nov 06 08:36:30 srv01 cib: [2618]: info: cib_process_request: Operation 
> complete: op cib_modify for section nodes (origin=local/crmd/725, 
> version=0.2504.166): ok (rc=0)
> Nov 06 08:36:30 srv01 attrd: [2620]: notice: attrd_local_callback: Sending 
> full refresh (origin=crmd)
> Nov 06 08:36:30 srv01 attrd: [2620]: notice: attrd_trigger_update: Sending 
> flush op to all hosts for: last-failure-record-events (1413990260)
> Nov 06 08:36:30 srv01 crmd: [2622]: WARN: match_down_event: No match for 
> shutdown action on srv02
> Nov 06 08:36:30 srv01 crmd: [2622]: info: te_update_diff: Stonith/shutdown of 
> srv02 not matched
> Nov 06 08:36:30 srv01 crmd: [2622]: info: abort_transition_graph: 
> te_update_diff:234 - Triggered transition abort (complete=1, tag=node_state, 
> id=srv02, magic=NA, cib=0.2504.167) : Node failure
> Nov 06 08:36:30 srv01 crmd: [2622]: info: do_pe_invoke: Query 729: Requesting 
> the current CIB: S_POLICY_ENGINE
> Nov 06 08:36:30 srv01 cib: [2618]: info: cib_process_request: Operation 
> complete: op cib_modify for section cib (origin=local/crmd/727, 
> version=0.2504.168): ok (rc=0)
> Nov 06 08:36:30 srv01 attrd: [2620]: notice: attrd_trigger_update: Sending 
> flush op to all hosts for: last-failure-replay-1 (1415025615)
> Nov 06 08:36:30 srv01 attrd: [2620]: notice: attrd_trigger_update: Sending 
> flush op to all hosts for: last-failure-4-server (1413990250)
> Nov 06 08:36:30 srv01 crmd: [2622]: info: do_pe_invoke_callback: Invoking the 
> PE: query=729, ref=pe_calc-dc-1415259390-10, seq=156, quorate=0
> Nov 06 08:36:30 srv01 attrd: [2620]: notice: attrd_trigger_update: Sending 
> flush op to all hosts for: last-failure-replay-2 (1415025597)
> Nov 06 08:36:30 srv01 attrd: [2620]: notice: attrd_trigger_update: Sending 
> flush op to all hosts for: probe_complete (true)
> Nov 06 08:36:30 srv01 attrd: [2620]: notice: attrd_trigger_update: Sending 
> flush op to all hosts for: fail-count-Proxy (1)
> Nov 06 08:36:30 srv01 attrd: [2620]: notice: attrd_trigger_update: Sending 
> flush op to all hosts for: last-failure-record-1 (1414568098)
> Nov 06 08:36:30 srv01 attrd: [2620]: notice: attrd_trigger_update: Sending 
> flush op to all hosts for: fail-count-replay-1 (1)
> Nov 06 08:36:30 srv01 pengine: [2621]: notice: unpack_config: On loss of CCM 
> Quorum: Ignore
> Nov 06 08:36:30 srv01 pengine: [2621]: ERROR: unpack_rsc_op: Hard error - 
> 5-controller_last_failure_0 failed with rc=3: Preventing 5-controller from 
> re-starting on srv01
> Nov 06 08:36:30 srv01 pengine: [2621]: WARN: unpack_rsc_op: Processing failed 
> op 5-controller_last_failure_0 on srv01: unimplemented feature (3)


_______________________________________________
Pacemaker mailing list: [email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] "Hard error" Preventing a service to restart

Reply via email to