Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

Auer, Jens Mon, 19 Sep 2016 03:26:56 -0700

Ok, after reading the log files again I found 

Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Initiating action 3: stop 
mda-ip_stop_0 on MDA1PFP-PCS01 (local)
Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: 
MDA1PFP-PCS01-mda-ip_monitor_1000:14 [ ocf-exit-reason:Unknown interface 
[bond0] No such device.\n ]
Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: ERROR: Unknown interface 
[bond0] No such device.
Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: WARNING: [findif] failed
Sep 19 10:03:45 MDA1PFP-S01 lrmd[7794]:  notice: mda-ip_stop_0:8745:stderr [ 
ocf-exit-reason:Unknown interface [bond0] No such device. ]
Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Operation mda-ip_stop_0: ok 
(node=MDA1PFP-PCS01, call=16, rc=0, cib-update=49, confirmed=true)
Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: Transition 3 (Complete=2, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-501.bz2): Complete
Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
origin=notify_crmd ]
Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:  notice: On loss of CCM Quorum: 
Ignore
Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]: warning: Processing failed op 
monitor for mda-ip on MDA1PFP-PCS01: not configured (6)
Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:   error: Preventing mda-ip from 
re-starting anywhere: operation monitor failed 'not configured' (6)


I think that explains why the resource is not started on the other node, but I 
am not sure this is a good decision. It seems to be a little harsh to prevent 
the resource from starting anywhere, especially considering that the other node 
will be able to start the resource. 

Cheers,
  Jens
--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
[email protected]
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter 
de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI 
Group Inc. and its affiliates may be contained in this message. If you are not 
a recipient indicated or intended in this message (or responsible for delivery 
of this message to such person), or you think for any reason that this message 
may have been addressed to you in error, you may not use or copy or deliver 
this message to anyone else. In such case, you should destroy this message and 
are asked to notify the sender by reply e-mail.

________________________________________
Von: Auer, Jens
Gesendet: Montag, 19. September 2016 12:08
An: Cluster Labs - All topics related to open-source clustering welcomed
Betreff: AW: [ClusterLabs] Virtual ip resource restarted on node with down 
network device

Hi,

> Would "rmmod <interface-driver-module>" be a better hammer of choice?

I am just testing what happens in case of hardware/network issues. Any hammer 
is good enough. Worst case would be that I unplug the machine, maybe with ILO.

I have created a simple testing setup of a two-node cluter with a virtual ip 
and a ping resource which should move to the other node when I unload the 
drivers on the active node. The configuration is
MDA1PFP-S02 10:02:53 1203 0 ~ # pcs cluster setup --name MDA1PFP 
MDA1PFP-PCS01,MDA1PFP-S01 MDA1PFP-PCS02,MDA1PFP-S02
Shutting down pacemaker/corosync services...
Redirecting to /bin/systemctl stop  pacemaker.service
Redirecting to /bin/systemctl stop  corosync.service
Killing any remaining services...
Removing all cluster configuration files...
MDA1PFP-PCS01: Succeeded
MDA1PFP-PCS02: Succeeded
Synchronizing pcsd certificates on nodes MDA1PFP-PCS01, MDA1PFP-PCS02...
MDA1PFP-PCS01: Success
MDA1PFP-PCS02: Success

Restaring pcsd on the nodes in order to reload the certificates...
MDA1PFP-PCS01: Success
MDA1PFP-PCS02: Success
MDA1PFP-S02 10:03:02 1204 0 ~ # pcs cluster start --all
MDA1PFP-PCS01: Starting Cluster...
MDA1PFP-PCS02: Starting Cluster...
MDA1PFP-S02 10:03:03 1205 0 ~ # sleep 5
rm -f mda; pcs cluster cib mda
pcs -f mda property set no-quorum-policy=ignore

pcs -f mda resource create mda-ip ocf:heartbeat:IPaddr2 ip=192.168.120.20 
cidr_netmask=32 nic=bond0 op monitor interval=1s
MDA1PFP-S02 10:03:08 1206 0 ~ # crm_attribute --type nodes --node MDA1PFP-PCS01 
--name ServerRole --update PRIME
MDA1PFP-S02 10:03:08 1207 0 ~ # crm_attribute --type nodes --node MDA1PFP-PCS02 
--name ServerRole --update BACKUP
MDA1PFP-S02 10:03:08 1208 0 ~ # pcs property set stonith-enabled=false
MDA1PFP-S02 10:03:08 1209 0 ~ # rm -f mda; pcs cluster cib mda
MDA1PFP-S02 10:03:08 1210 0 ~ # pcs -f mda property set no-quorum-policy=ignore
MDA1PFP-S02 10:03:08 1211 0 ~ #
MDA1PFP-S02 10:03:08 1211 0 ~ # pcs -f mda resource create mda-ip 
ocf:heartbeat:IPaddr2 ip=192.168.120.20 cidr_netmask=32 nic=bond0 op monitor 
interval=1s
MDA1PFP-S02 10:03:08 1212 0 ~ # pcs -f mda resource create ping 
ocf:pacemaker:ping dampen=5s multiplier=1000 host_list=pf-pep-dev-1  params 
timeout=1 attempts=3  op monitor interval=1 --clone
MDA1PFP-S02 10:03:12 1213 0 ~ # pcs -f mda constraint location mda-ip rule 
score=-INFINITY pingd lt 1 or not_defined pingd
MDA1PFP-S02 10:03:12 1214 0 ~ # pcs cluster cib-push mda
CIB updated

When I now unload the drivers on the active node the VIP resource is stopped 
but never started on the other node although it can ping.

MDA1PFP-S01 10:02:49 2162 0 ~ # modprobe -r bonding; modprobe -r ixgbe
MDA1PFP-S01 10:03:45 2163 0 ~ # pcs status
Cluster name: MDA1PFP
Last updated: Mon Sep 19 10:04:38 2016          Last change: Mon Sep 19 
10:03:25 2016 by hacluster via crmd on MDA1PFP-PCS01
Stack: corosync
Current DC: MDA1PFP-PCS01 (version 1.1.13-10.el7-44eb2dd) - partition with 
quorum
2 nodes and 3 resources configured

Online: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]

Full list of resources:

 mda-ip (ocf::heartbeat:IPaddr2):       Stopped
 Clone Set: ping-clone [ping]
     Started: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]

Failed Actions:
* mda-ip_monitor_1000 on MDA1PFP-PCS01 'not configured' (6): call=14, 
status=complete, exitreason='Unknown interface [bond0] No such device.',
    last-rc-change='Mon Sep 19 10:03:45 2016', queued=0ms, exec=0ms


PCSD Status:
  MDA1PFP-PCS01: Online
  MDA1PFP-PCS02: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

The log from the otehr node to which the resource should be migrated is:
Sep 19 10:03:12 MDA1PFP-S02 pcsd: Starting pcsd:
Sep 19 10:03:12 MDA1PFP-S02 systemd: Starting PCS GUI and remote configuration 
interface...
Sep 19 10:03:12 MDA1PFP-S02 systemd: Started PCS GUI and remote configuration 
interface.
Sep 19 10:03:15 MDA1PFP-S02 attrd[12444]:  notice: Updating all attributes 
after cib_refresh_notify event
Sep 19 10:03:15 MDA1PFP-S02 crmd[12446]:  notice: Notifications disabled
Sep 19 10:03:25 MDA1PFP-S02 crmd[12446]: warning: FSA: Input I_DC_TIMEOUT from 
crm_timer_popped() received in state S_PENDING
Sep 19 10:03:25 MDA1PFP-S02 crmd[12446]:  notice: State transition S_ELECTION 
-> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL 
origin=do_election_count_vote ]
Sep 19 10:03:25 MDA1PFP-S02 crmd[12446]:  notice: State transition S_PENDING -> 
S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond 
]
Sep 19 10:03:25 MDA1PFP-S02 attrd[12444]:  notice: Processing sync-response 
from MDA1PFP-PCS01
Sep 19 10:03:26 MDA1PFP-S02 crmd[12446]:  notice: Operation ping_monitor_0: not 
running (node=MDA1PFP-PCS02, call=10, rc=7, cib-update=13, confirmed=true)
Sep 19 10:03:26 MDA1PFP-S02 crmd[12446]:  notice: Operation mda-ip_monitor_0: 
not running (node=MDA1PFP-PCS02, call=5, rc=7, cib-update=14, confirmed=true)
Sep 19 10:03:28 MDA1PFP-S02 crmd[12446]:  notice: Operation ping_start_0: ok 
(node=MDA1PFP-PCS02, call=11, rc=0, cib-update=15, confirmed=true)
Sep 19 10:03:48 MDA1PFP-S02 corosync[12425]: [TOTEM ] Marking ringid 1 
interface 192.168.120.11 FAULTY

On the node initial active node hosting the VIP the log is
Sep 19 10:03:45 MDA1PFP-S01 avahi-daemon[912]: Withdrawing address record for 
fe80::5eb9:1ff:fe9c:e7fc on bond0.
Sep 19 10:03:45 MDA1PFP-S01 avahi-daemon[912]: Withdrawing address record for 
192.168.120.20 on bond0.
Sep 19 10:03:45 MDA1PFP-S01 avahi-daemon[912]: Withdrawing address record for 
192.168.120.10 on bond0.
Sep 19 10:03:45 MDA1PFP-S01 avahi-daemon[912]: Withdrawing workstation service 
for bond0.
Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <info>  (bond0): bond slave 
eno49 was released
Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <info>  (eno49): released from 
master bond0
Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <info>  (bond0): bond slave 
eno50 was released
Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <info>  (eno50): released from 
master bond0
Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <info>  (eno50): link 
disconnected
Sep 19 10:03:45 MDA1PFP-S01 gnome-session: Gjs-Message: JS LOG: Removing a 
network device that was not added
Sep 19 10:03:45 MDA1PFP-S01 avahi-daemon[912]: Withdrawing workstation service 
for eno50.
Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <warn>  (eno50): failed to 
disable userspace IPv6LL address handling
Sep 19 10:03:45 MDA1PFP-S01 avahi-daemon[912]: Withdrawing workstation service 
for eno49.
Sep 19 10:03:45 MDA1PFP-S01 kernel: ixgbe 0000:04:00.1: complete
Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <info>  (eno49): device state 
change: disconnected -> unmanaged (reason 'removed') [30 10 36]
Sep 19 10:03:45 MDA1PFP-S01 NetworkManager[881]: <warn>  (eno49): failed to 
disable userspace IPv6LL address handling
Sep 19 10:03:45 MDA1PFP-S01 kernel: ixgbe 0000:04:00.0: complete
Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8714]: ERROR: Unknown interface 
[bond0] No such device.
Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8714]: ERROR: [findif] failed
Sep 19 10:03:45 MDA1PFP-S01 lrmd[7794]:  notice: 
mda-ip_monitor_1000:8714:stderr [ ocf-exit-reason:Unknown interface [bond0] No 
such device. ]
Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: 
MDA1PFP-PCS01-mda-ip_monitor_1000:14 [ ocf-exit-reason:Unknown interface 
[bond0] No such device.\n ]
Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Sep 19 10:03:45 MDA1PFP-S01 pengine[7796]:  notice: On loss of CCM Quorum: 
Ignore
Sep 19 10:03:45 MDA1PFP-S01 pengine[7796]: warning: Processing failed op 
monitor for mda-ip on MDA1PFP-PCS01: not configured (6)
Sep 19 10:03:45 MDA1PFP-S01 pengine[7796]:   error: Preventing mda-ip from 
re-starting anywhere: operation monitor failed 'not configured' (6)
Sep 19 10:03:45 MDA1PFP-S01 pengine[7796]:  notice: Stop    mda-ip      
(MDA1PFP-PCS01)
Sep 19 10:03:45 MDA1PFP-S01 pengine[7796]:  notice: Calculated Transition 3: 
/var/lib/pacemaker/pengine/pe-input-501.bz2
Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Initiating action 3: stop 
mda-ip_stop_0 on MDA1PFP-PCS01 (local)
Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: 
MDA1PFP-PCS01-mda-ip_monitor_1000:14 [ ocf-exit-reason:Unknown interface 
[bond0] No such device.\n ]
Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: ERROR: Unknown interface 
[bond0] No such device.
Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: WARNING: [findif] failed
Sep 19 10:03:45 MDA1PFP-S01 lrmd[7794]:  notice: mda-ip_stop_0:8745:stderr [ 
ocf-exit-reason:Unknown interface [bond0] No such device. ]
Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Operation mda-ip_stop_0: ok 
(node=MDA1PFP-PCS01, call=16, rc=0, cib-update=49, confirmed=true)
Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: Transition 3 (Complete=2, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-501.bz2): Complete
Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
origin=notify_crmd ]
Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:  notice: On loss of CCM Quorum: 
Ignore
Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]: warning: Processing failed op 
monitor for mda-ip on MDA1PFP-PCS01: not configured (6)
Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:   error: Preventing mda-ip from 
re-starting anywhere: operation monitor failed 'not configured' (6)
Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:  notice: Calculated Transition 4: 
/var/lib/pacemaker/pengine/pe-input-502.bz2
Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: Transition 4 (Complete=0, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-502.bz2): Complete
Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
origin=notify_crmd ]
Sep 19 10:03:46 MDA1PFP-S01 ntpd[24456]: Deleting interface #21 bond0, 
192.168.120.20#123, interface stats: received=0, sent=0, dropped=0, 
active_time=12 secs
Sep 19 10:03:46 MDA1PFP-S01 ntpd[24456]: Deleting interface #19 bond0, 
fe80::5eb9:1ff:fe9c:e7fc#123, interface stats: received=0, sent=0, dropped=0, 
active_time=218 secs
Sep 19 10:03:46 MDA1PFP-S01 ntpd[24456]: Deleting interface #18 bond0, 
192.168.120.10#123, interface stats: received=0, sent=0, dropped=0, 
active_time=218 secs
Sep 19 10:03:47 MDA1PFP-S01 corosync[7776]: [TOTEM ] Retransmit List: a0
Sep 19 10:03:47 MDA1PFP-S01 corosync[7776]: [TOTEM ] Retransmit List: a3 a5
Sep 19 10:03:47 MDA1PFP-S01 corosync[7776]: [TOTEM ] Retransmit List: a5 a7
Sep 19 10:03:47 MDA1PFP-S01 corosync[7776]: [TOTEM ] Retransmit List: a5
Sep 19 10:03:48 MDA1PFP-S01 corosync[7776]: [TOTEM ] Marking ringid 1 interface 
192.168.120.10 FAULTY
Sep 19 10:03:54 MDA1PFP-S01 crmd[7797]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Sep 19 10:03:54 MDA1PFP-S01 pengine[7796]:  notice: On loss of CCM Quorum: 
Ignore
Sep 19 10:03:54 MDA1PFP-S01 pengine[7796]: warning: Processing failed op 
monitor for mda-ip on MDA1PFP-PCS01: not configured (6)
Sep 19 10:03:54 MDA1PFP-S01 pengine[7796]:   error: Preventing mda-ip from 
re-starting anywhere: operation monitor failed 'not configured' (6)
Sep 19 10:03:54 MDA1PFP-S01 pengine[7796]:  notice: Calculated Transition 5: 
/var/lib/pacemaker/pengine/pe-input-503.bz2
Sep 19 10:03:54 MDA1PFP-S01 crmd[7797]:  notice: Transition 5 (Complete=0, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-503.bz2): Complete
Sep 19 10:03:54 MDA1PFP-S01 crmd[7797]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
origin=notify_crmd ]

Best wishes,
  Jens


--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
[email protected]
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter 
de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI 
Group Inc. and its affiliates may be contained in this message. If you are not 
a recipient indicated or intended in this message (or responsible for delivery 
of this message to such person), or you think for any reason that this message 
may have been addressed to you in error, you may not use or copy or deliver 
this message to anyone else. In such case, you should destroy this message and 
are asked to notify the sender by reply e-mail.

_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

Reply via email to