Public bug reported:

Hi,

Running mitaka on xenial (neutron 2:8.4.0-0ubuntu6). We have l2pop and
no l3ha. Using ovs with GRE tunnels.

The cloud has around 30 compute nodes (mostly arm64). Last week, ovs got
restarted during a package upgrade :

2018-03-21 17:17:25 upgrade openvswitch-common:arm64
2.5.2-0ubuntu0.16.04.3 2.5.4-0ubuntu0.16.04.1

This led to instances on 2 arm64 compute nodes lose networking
completely. Upon closer inspection, I realized that a flow was missing
in br-tun table 3 : https://pastebin.ubuntu.com/p/VXRJJX8J3k/

I believe this is due to a race in ovs_neutron_agent.py. These flows in
table 3 are set up in provision_local_vlan() :
https://github.com/openstack/neutron/blob/mitaka-
eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L675

which is called by port_bound() :
https://github.com/openstack/neutron/blob/mitaka-eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L789-L791

which is called by treat_vif_port() :
https://github.com/openstack/neutron/blob/mitaka-
eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1405-L1410

which is called by treat_devices_added_or_updated() :
https://github.com/openstack/neutron/blob/mitaka-
eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1517-L1525

which is called by process_network_ports() :
https://github.com/openstack/neutron/blob/mitaka-
eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1618-L1623

which is called by the big rpc_loop() :
https://github.com/openstack/neutron/blob/mitaka-
eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L2023-L2029


So how does the agent knows when to create these table 3 flows ? Well, in 
rpc_loop(), it checks for OVS restarts 
(https://github.com/openstack/neutron/blob/mitaka-eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1947-L1948),
 and if OVS did restart, it does some basic ovs setup (default flows, etc), and 
(very important for later), it restarts the OVS polling manager.

Later (still in rpc_loop()), it sets "ovs_restarted" to True, and
process the ports as usual. The expected behaviour here is that since
the polling manager got restarted, any port up will be marked as "added"
and processed as such, in port_bound() (see call stack above). If this
function is called on a port when ovs_restarted is True, then
provision_local_vlan() will get called and will able the table 3 flows.

This is all working great under the assumption that the polling manager
(which is an async process) will raise the "I got new port !" event
before the rpc_loop() checks it (in process_port_events(), called by
process_port_info()). However, if for example the node is under load,
this may not always be the case.

What happens then is that the rpc_loop in which OVS is detected as
restarted doesn't see any change on the ports, and so does nothing. The
next run of the rpc_loop will process the "I got new port !" events, but
that loop will not be running with ovs_restarted set to True, so the
ports won't be brought up properly - more specifically, the table 3
flows in br-tun will be missing. This is shown in the debug logs :
https://pastebin.ubuntu.com/p/M8yYn3YnQ6/ - you can see the loop in
which "OVS is restarted" is detected (loop iteration 320773) doesn't
process any port ("iteration:320773 completed. Processed ports
statistics: {'regular': {'updated': 0, 'added': 0, 'removed': 0}}.), but
the next iteration does process 3 "added" ports. You can see that the
"output received" is logged in the first loop, 49ms after "starting
polling" is logged, which is presumably the problem. On all the non-
failing nodes, the output is received before "starting polling".

I believe the proper thing to do is to set "sync" to True (in
rpc_loop()) if an ovs restart is detected, forcing process_port_info()
to not use async events and scan the ports itself using scan_ports().

Thanks

** Affects: neutron (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1758868

Title:
  ovs restart can lead to critical ovs flows missing

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/1758868/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to