Ok, after reading the log files again I found Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]: notice: Initiating action 3: stop mda-ip_stop_0 on MDA1PFP-PCS01 (local) Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]: notice: MDA1PFP-PCS01-mda-ip_monitor_1000:14 [ ocf-exit-reason:Unknown interface [bond0] No such device.\n ] Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: ERROR: Unknown interface [bond0] No such device. Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: WARNING: [findif] failed Sep 19 10:03:45 MDA1PFP-S01 lrmd[7794]: notice: mda-ip_stop_0:8745:stderr [ ocf-exit-reason:Unknown interface [bond0] No such device. ] Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]: notice: Operation mda-ip_stop_0: ok (node=MDA1PFP-PCS01, call=16, rc=0, cib-update=49, confirmed=true) Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]: notice: Transition 3 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-501.bz2): Complete Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]: notice: On loss of CCM Quorum: Ignore Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]: warning: Processing failed op monitor for mda-ip on MDA1PFP-PCS01: not configured (6) Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]: error: Preventing mda-ip from re-starting anywhere: operation monitor failed 'not configured' (6)
I think that explains why the resource is not started on the other node, but I am not sure this is a good decision. It seems to be a little harsh to prevent the resource from starting anywhere, especially considering that the other node will be able to start the resource. Cheers, Jens -- Jens Auer | CGI | Software-Engineer CGI (Germany) GmbH & Co. KG Rheinstraße 95 | 64295 Darmstadt | Germany T: +49 6151 36860 154 [email protected] Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter de.cgi.com/pflichtangaben. CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI Group Inc. and its affiliates may be contained in this message. If you are not a recipient indicated or intended in this message (or responsible for delivery of this message to such person), or you think for any reason that this message may have been addressed to you in error, you may not use or copy or deliver this message to anyone else. In such case, you should destroy this message and are asked to notify the sender by reply e-mail. ________________________________________ Von: Auer, Jens Gesendet: Montag, 19. September 2016 12:08 An: Cluster Labs - All topics related to open-source clustering welcomed Betreff: AW: [ClusterLabs] Virtual ip resource restarted on node with down network device Hi, > Would "rmmod <interface-driver-module>" be a better hammer of choice? I am just testing what happens in case of hardware/network issues. Any hammer is good enough. Worst case would be that I unplug the machine, maybe with ILO. I have created a simple testing setup of a two-node cluter with a virtual ip and a ping resource which should move to the other node when I unload the drivers on the active node. The configuration is MDA1PFP-S02 10:02:53 1203 0 ~ # pcs cluster setup --name MDA1PFP MDA1PFP-PCS01,MDA1PFP-S01 MDA1PFP-PCS02,MDA1PFP-S02 Shutting down pacemaker/corosync services... Redirecting to /bin/systemctl stop pacemaker.service Redirecting to /bin/systemctl stop corosync.service Killing any remaining services... Removing all cluster configuration files... MDA1PFP-PCS01: Succeeded MDA1PFP-PCS02: Succeeded Synchronizing pcsd certificates on nodes MDA1PFP-PCS01, MDA1PFP-PCS02... MDA1PFP-PCS01: Success MDA1PFP-PCS02: Success Restaring pcsd on the nodes in order to reload the certificates... MDA1PFP-PCS01: Success MDA1PFP-PCS02: Success MDA1PFP-S02 10:03:02 1204 0 ~ # pcs cluster start --all MDA1PFP-PCS01: Starting Cluster... MDA1PFP-PCS02: Starting Cluster... MDA1PFP-S02 10:03:03 1205 0 ~ # sleep 5 rm -f mda; pcs cluster cib mda pcs -f mda property set no-quorum-policy=ignore pcs -f mda resource create mda-ip ocf:heartbeat:IPaddr2 ip=192.168.120.20 cidr_netmask=32 nic=bond0 op monitor interval=1s MDA1PFP-S02 10:03:08 1206 0 ~ # crm_attribute --type nodes --node MDA1PFP-PCS01 --name ServerRole --update PRIME MDA1PFP-S02 10:03:08 1207 0 ~ # crm_attribute --type nodes --node MDA1PFP-PCS02 --name ServerRole --update BACKUP MDA1PFP-S02 10:03:08 1208 0 ~ # pcs property set stonith-enabled=false MDA1PFP-S02 10:03:08 1209 0 ~ # rm -f mda; pcs cluster cib mda MDA1PFP-S02 10:03:08 1210 0 ~ # pcs -f mda property set no-quorum-policy=ignore MDA1PFP-S02 10:03:08 1211 0 ~ # MDA1PFP-S02 10:03:08 1211 0 ~ # pcs -f mda resource create mda-ip ocf:heartbeat:IPaddr2 ip=192.168.120.20 cidr_netmask=32 nic=bond0 op monitor interval=1s MDA1PFP-S02 10:03:08 1212 0 ~ # pcs -f mda resource create ping ocf:pacemaker:ping dampen=5s multiplier=1000 host_list=pf-pep-dev-1 params timeout=1 attempts=3 op monitor interval=1 --clone MDA1PFP-S02 10:03:12 1213 0 ~ # pcs -f mda constraint location mda-ip rule score=-INFINITY pingd lt 1 or not_defined pingd MDA1PFP-S02 10:03:12 1214 0 ~ # pcs cluster cib-push mda CIB updated When I now unload the drivers on the active node the VIP resource is stopped but never started on the other node although it can ping. MDA1PFP-S01 10:02:49 2162 0 ~ # modprobe -r bonding; modprobe -r ixgbe MDA1PFP-S01 10:03:45 2163 0 ~ # pcs status Cluster name: MDA1PFP Last updated: Mon Sep 19 10:04:38 2016 Last change: Mon Sep 19 10:03:25 2016 by hacluster via crmd on MDA1PFP-PCS01 Stack: corosync Current DC: MDA1PFP-PCS01 (version 1.1.13-10.el7-44eb2dd) - partition with quorum 2 nodes and 3 resources configured Online: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ] Full list of resources: mda-ip (ocf::heartbeat:IPaddr2): Stopped Clone Set: ping-clone [ping] Started: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ] Failed Actions: * mda-ip_monitor_1000 on MDA1PFP-PCS01 'not configured' (6): call=14, status=complete, exitreason='Unknown interface [bond0] No such device.', last-rc-change='Mon Sep 19 10:03:45 2016', queued=0ms, exec=0ms PCSD Status: MDA1PFP-PCS01: Online MDA1PFP-PCS02: Online Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled The log from the otehr node to which the resource should be migrated is: Sep 19 10:03:12 MDA1PFP-S02 pcsd: Starting pcsd: Sep 19 10:03:12 MDA1PFP-S02 systemd: Starting PCS GUI and remote configuration interface... Sep 19 10:03:12 MDA1PFP-S02 systemd: Started PCS GUI and remote configuration interface. Sep 19 10:03:15 MDA1PFP-S02 attrd[12444]: notice: Updating all attributes after cib_refresh_notify event Sep 19 10:03:15 MDA1PFP-S02 crmd[12446]: notice: Notifications disabled Sep 19 10:03:25 MDA1PFP-S02 crmd[12446]: warning: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING Sep 19 10:03:25 MDA1PFP-S02 crmd[12446]: notice: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ] Sep 19 10:03:25 MDA1PFP-S02 crmd[12446]: notice: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ] Sep 19 10:03:25 MDA1PFP-S02 attrd[12444]: notice: Processing sync-response from MDA1PFP-PCS01 Sep 19 10:03:26 MDA1PFP-S02 crmd[12446]: notice: Operation ping_monitor_0: not running (node=MDA1PFP-PCS02, call=10, rc=7, cib-update=13, confirmed=true) Sep 19 10:03:26 MDA1PFP-S02 crmd[12446]: notice: Operation mda-ip_monitor_0: not running (node=MDA1PFP-PCS02, call=5, rc=7, cib-update=14, confirmed=true) Sep 19 10:03:28 MDA1PFP-S02 crmd[12446]: notice: Operation ping_start_0: ok (node=MDA1PFP-PCS02, call=11, rc=0, cib-update=15, confirmed=true) Sep 19 10:03:48 MDA1PFP-S02 corosync[12425]: [TOTEM ] Marking ringid 1 interface 192.168.120.11 FAULTY On the node initial active node hosting the VIP the log is Sep 19 10:03:45 MDA1PFP-S01 avahi-daemon[912]: Withdrawing address record for fe80::5eb9:1ff:fe9c:e7fc on bond0. Sep 19 10:03:45 MDA1PFP-S01 avahi-daemon[912]: Withdrawing address record for 192.168.120.20 on bond0. Sep 19 10:03:45 MDA1PFP-S01 avahi-daemon[912]: Withdrawing address record for 192.168.120.10 on bond0. Sep 19 10:03:45 MDA1PFP-S01 avahi-daemon[912]: Withdrawing workstation service for bond0. Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <info> (bond0): bond slave eno49 was released Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <info> (eno49): released from master bond0 Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <info> (bond0): bond slave eno50 was released Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <info> (eno50): released from master bond0 Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <info> (eno50): link disconnected Sep 19 10:03:45 MDA1PFP-S01 gnome-session: Gjs-Message: JS LOG: Removing a network device that was not added Sep 19 10:03:45 MDA1PFP-S01 avahi-daemon[912]: Withdrawing workstation service for eno50. Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <warn> (eno50): failed to disable userspace IPv6LL address handling Sep 19 10:03:45 MDA1PFP-S01 avahi-daemon[912]: Withdrawing workstation service for eno49. Sep 19 10:03:45 MDA1PFP-S01 kernel: ixgbe 0000:04:00.1: complete Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <info> (eno49): device state change: disconnected -> unmanaged (reason 'removed') [30 10 36] Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <warn> (eno49): failed to disable userspace IPv6LL address handling Sep 19 10:03:45 MDA1PFP-S01 kernel: ixgbe 0000:04:00.0: complete Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8714]: ERROR: Unknown interface [bond0] No such device. Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8714]: ERROR: [findif] failed Sep 19 10:03:45 MDA1PFP-S01 lrmd[7794]: notice: mda-ip_monitor_1000:8714:stderr [ ocf-exit-reason:Unknown interface [bond0] No such device. ] Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]: notice: MDA1PFP-PCS01-mda-ip_monitor_1000:14 [ ocf-exit-reason:Unknown interface [bond0] No such device.\n ] Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Sep 19 10:03:45 MDA1PFP-S01 pengine[7796]: notice: On loss of CCM Quorum: Ignore Sep 19 10:03:45 MDA1PFP-S01 pengine[7796]: warning: Processing failed op monitor for mda-ip on MDA1PFP-PCS01: not configured (6) Sep 19 10:03:45 MDA1PFP-S01 pengine[7796]: error: Preventing mda-ip from re-starting anywhere: operation monitor failed 'not configured' (6) Sep 19 10:03:45 MDA1PFP-S01 pengine[7796]: notice: Stop mda-ip (MDA1PFP-PCS01) Sep 19 10:03:45 MDA1PFP-S01 pengine[7796]: notice: Calculated Transition 3: /var/lib/pacemaker/pengine/pe-input-501.bz2 Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]: notice: Initiating action 3: stop mda-ip_stop_0 on MDA1PFP-PCS01 (local) Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]: notice: MDA1PFP-PCS01-mda-ip_monitor_1000:14 [ ocf-exit-reason:Unknown interface [bond0] No such device.\n ] Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: ERROR: Unknown interface [bond0] No such device. Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: WARNING: [findif] failed Sep 19 10:03:45 MDA1PFP-S01 lrmd[7794]: notice: mda-ip_stop_0:8745:stderr [ ocf-exit-reason:Unknown interface [bond0] No such device. ] Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]: notice: Operation mda-ip_stop_0: ok (node=MDA1PFP-PCS01, call=16, rc=0, cib-update=49, confirmed=true) Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]: notice: Transition 3 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-501.bz2): Complete Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]: notice: On loss of CCM Quorum: Ignore Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]: warning: Processing failed op monitor for mda-ip on MDA1PFP-PCS01: not configured (6) Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]: error: Preventing mda-ip from re-starting anywhere: operation monitor failed 'not configured' (6) Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]: notice: Calculated Transition 4: /var/lib/pacemaker/pengine/pe-input-502.bz2 Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]: notice: Transition 4 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-502.bz2): Complete Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Sep 19 10:03:46 MDA1PFP-S01 ntpd[24456]: Deleting interface #21 bond0, 192.168.120.20#123, interface stats: received=0, sent=0, dropped=0, active_time=12 secs Sep 19 10:03:46 MDA1PFP-S01 ntpd[24456]: Deleting interface #19 bond0, fe80::5eb9:1ff:fe9c:e7fc#123, interface stats: received=0, sent=0, dropped=0, active_time=218 secs Sep 19 10:03:46 MDA1PFP-S01 ntpd[24456]: Deleting interface #18 bond0, 192.168.120.10#123, interface stats: received=0, sent=0, dropped=0, active_time=218 secs Sep 19 10:03:47 MDA1PFP-S01 corosync[7776]: [TOTEM ] Retransmit List: a0 Sep 19 10:03:47 MDA1PFP-S01 corosync[7776]: [TOTEM ] Retransmit List: a3 a5 Sep 19 10:03:47 MDA1PFP-S01 corosync[7776]: [TOTEM ] Retransmit List: a5 a7 Sep 19 10:03:47 MDA1PFP-S01 corosync[7776]: [TOTEM ] Retransmit List: a5 Sep 19 10:03:48 MDA1PFP-S01 corosync[7776]: [TOTEM ] Marking ringid 1 interface 192.168.120.10 FAULTY Sep 19 10:03:54 MDA1PFP-S01 crmd[7797]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Sep 19 10:03:54 MDA1PFP-S01 pengine[7796]: notice: On loss of CCM Quorum: Ignore Sep 19 10:03:54 MDA1PFP-S01 pengine[7796]: warning: Processing failed op monitor for mda-ip on MDA1PFP-PCS01: not configured (6) Sep 19 10:03:54 MDA1PFP-S01 pengine[7796]: error: Preventing mda-ip from re-starting anywhere: operation monitor failed 'not configured' (6) Sep 19 10:03:54 MDA1PFP-S01 pengine[7796]: notice: Calculated Transition 5: /var/lib/pacemaker/pengine/pe-input-503.bz2 Sep 19 10:03:54 MDA1PFP-S01 crmd[7797]: notice: Transition 5 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-503.bz2): Complete Sep 19 10:03:54 MDA1PFP-S01 crmd[7797]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Best wishes, Jens -- Jens Auer | CGI | Software-Engineer CGI (Germany) GmbH & Co. KG Rheinstraße 95 | 64295 Darmstadt | Germany T: +49 6151 36860 154 [email protected] Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter de.cgi.com/pflichtangaben. CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI Group Inc. and its affiliates may be contained in this message. If you are not a recipient indicated or intended in this message (or responsible for delivery of this message to such person), or you think for any reason that this message may have been addressed to you in error, you may not use or copy or deliver this message to anyone else. In such case, you should destroy this message and are asked to notify the sender by reply e-mail. _______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
