Public bug reported: Hi,
Running mitaka on xenial (neutron 2:8.4.0-0ubuntu6). We have l2pop and no l3ha. Using ovs with GRE tunnels. The cloud has around 30 compute nodes (mostly arm64). Last week, ovs got restarted during a package upgrade : 2018-03-21 17:17:25 upgrade openvswitch-common:arm64 2.5.2-0ubuntu0.16.04.3 2.5.4-0ubuntu0.16.04.1 This led to instances on 2 arm64 compute nodes lose networking completely. Upon closer inspection, I realized that a flow was missing in br-tun table 3 : https://pastebin.ubuntu.com/p/VXRJJX8J3k/ I believe this is due to a race in ovs_neutron_agent.py. These flows in table 3 are set up in provision_local_vlan() : https://github.com/openstack/neutron/blob/mitaka- eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L675 which is called by port_bound() : https://github.com/openstack/neutron/blob/mitaka-eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L789-L791 which is called by treat_vif_port() : https://github.com/openstack/neutron/blob/mitaka- eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1405-L1410 which is called by treat_devices_added_or_updated() : https://github.com/openstack/neutron/blob/mitaka- eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1517-L1525 which is called by process_network_ports() : https://github.com/openstack/neutron/blob/mitaka- eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1618-L1623 which is called by the big rpc_loop() : https://github.com/openstack/neutron/blob/mitaka- eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L2023-L2029 So how does the agent knows when to create these table 3 flows ? Well, in rpc_loop(), it checks for OVS restarts (https://github.com/openstack/neutron/blob/mitaka-eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1947-L1948), and if OVS did restart, it does some basic ovs setup (default flows, etc), and (very important for later), it restarts the OVS polling manager. Later (still in rpc_loop()), it sets "ovs_restarted" to True, and process the ports as usual. The expected behaviour here is that since the polling manager got restarted, any port up will be marked as "added" and processed as such, in port_bound() (see call stack above). If this function is called on a port when ovs_restarted is True, then provision_local_vlan() will get called and will able the table 3 flows. This is all working great under the assumption that the polling manager (which is an async process) will raise the "I got new port !" event before the rpc_loop() checks it (in process_port_events(), called by process_port_info()). However, if for example the node is under load, this may not always be the case. What happens then is that the rpc_loop in which OVS is detected as restarted doesn't see any change on the ports, and so does nothing. The next run of the rpc_loop will process the "I got new port !" events, but that loop will not be running with ovs_restarted set to True, so the ports won't be brought up properly - more specifically, the table 3 flows in br-tun will be missing. This is shown in the debug logs : https://pastebin.ubuntu.com/p/M8yYn3YnQ6/ - you can see the loop in which "OVS is restarted" is detected (loop iteration 320773) doesn't process any port ("iteration:320773 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 0, 'removed': 0}}.), but the next iteration does process 3 "added" ports. You can see that the "output received" is logged in the first loop, 49ms after "starting polling" is logged, which is presumably the problem. On all the non- failing nodes, the output is received before "starting polling". I believe the proper thing to do is to set "sync" to True (in rpc_loop()) if an ovs restart is detected, forcing process_port_info() to not use async events and scan the ports itself using scan_ports(). Thanks ** Affects: neutron (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1758868 Title: ovs restart can lead to critical ovs flows missing To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/1758868/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs