I have a two node cluster which manages 4 resources in a resource group. Node 1 was active and was rebooted. Resources started on the second node. At the exact time the first node completed rebooting, crmd failed on the second node. Logs below. These nodes are running pacemaker-1.1.10-0.15.25 rpm.
Any ideas on how to determine what happened here? Problem with crmd? Oct 15 04:46:46 vho-1-mc2 crmd[12132]: error: crmd_node_update_complete: Node update 51 failed: Timer expired (-62) Oct 15 04:46:46 vho-1-mc2 crmd[12132]: error: do_log: FSA: Input I_ERROR from crmd_node_update_complete() received in state S_IDLE Oct 15 04:46:46 vho-1-mc2 crmd[12132]: notice: do_state_transition: State transition S_IDLE -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=crmd_node_update_complete ] Oct 15 04:46:46 vho-1-mc2 crmd[12132]: warning: do_recover: Fast-tracking shutdown in response to errors Oct 15 04:46:46 vho-1-mc2 crmd[12132]: warning: do_election_vote: Not voting in election, we're in state S_RECOVERY Oct 15 04:46:46 vho-1-mc2 crmd[12132]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Oct 15 04:46:46 vho-1-mc2 crmd[12132]: notice: lrm_state_verify_stopped: Stopped 0 recurring operations at shutdown (5 ops remaining) Oct 15 04:46:46 vho-1-mc2 crmd[12132]: notice: lrm_state_verify_stopped: Recurring action cdssRA:17 (cdssRA_monitor_15000) incomplete at shutdown Oct 15 04:46:46 vho-1-mc2 crmd[12132]: notice: lrm_state_verify_stopped: Recurring action mcast_IP:22 (mcast_IP_monitor_5000) incomplete at shutdown Oct 15 04:46:46 vho-1-mc2 crmd[12132]: notice: lrm_state_verify_stopped: Recurring action mgmt_IP:27 (mgmt_IP_monitor_5000) incomplete at shutdown Oct 15 04:46:46 vho-1-mc2 crmd[12132]: notice: lrm_state_verify_stopped: Recurring action cdssDB:12 (cdssDB_monitor_30000) incomplete at shutdown Oct 15 04:46:46 vho-1-mc2 crmd[12132]: notice: lrm_state_verify_stopped: Recurring action mcast-route:32 (mcast-route_monitor_10000) incomplete at shutdown Oct 15 04:46:46 vho-1-mc2 crmd[12132]: error: lrm_state_verify_stopped: 6 resources were active at shutdown. Oct 15 04:46:46 vho-1-mc2 crmd[12132]: notice: do_lrm_control: Disconnected from the LRM Oct 15 04:46:46 vho-1-mc2 crmd[12132]: notice: terminate_cs_connection: Disconnecting from Corosync Oct 15 04:46:46 vho-1-mc2 corosync[12120]: [pcmk ] info: pcmk_ipc_exit: Client crmd (conn=0x65e6d0, async-conn=0x65e6d0) left Oct 15 04:46:46 vho-1-mc2 crmd[12132]: error: crmd_fast_exit: Could not recover from internal error Oct 15 04:46:47 vho-1-mc2 corosync[12120]: [pcmk ] ERROR: pcmk_wait_dispatch: Child process crmd exited (pid=12132, rc=201) Oct 15 04:46:47 vho-1-mc2 corosync[12120]: [pcmk ] info: update_member: Node vho-1-mc2 now has process list: 00000000000000000000000000151112 (1380626) Oct 15 04:46:47 vho-1-mc2 corosync[12120]: [pcmk ] notice: pcmk_wait_dispatch: Respawning failed child process: crmd _______________________________________________ Pacemaker mailing list: [email protected] http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
