** Description changed: [Impact] During scalability tests where extreme load is generated by creating thousands of VMs all at the same time, some VMs fail to get a DHCP lease and cannot be pinged or sshed to after deployment. The ovnmeta namespaces for networks that the VMs were created in are missing. The following lines are present in neutron-ovn-metadata-agent.log: 2024-02-29 03:33:18.297 1080866 INFO neutron.agent.ovn.metadata.agent [-] Port 9a75c431-42c4-47bf-af0d-22e0d5ee11a8 in datapath 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 bound to our chassis 2024-02-29 03:33:18.306 1080866 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494 What is happening is that under extreme load, sometimes the metadata port - information has not been propagated by OVN to the Southbound database, which + information has not been propagated by OVN to the Southbound database, which usually takes the form of a update notification, and when PortBindingChassisEvent event is triggered in ovn-metadata-agent, it only looks for update notifications, finds none, so it doesn't know any metadata port or IP - information, fails, logs the message above, and tears down the ovnmetadata + information, fails, logs the message above, and tears down the ovnmetadata namespace for that VM. Eventually ovsdb-server catches up, and merges insert and update notifications and sends them out as a insert notification, which PortBindingChassisEvent currently ignores, and the metadata is never applied to the VM. This is a race condition, and it doesn't happen when under normal conditions, as the metadata would just be delivered a update notification. The fix is to also listen for insert notifications, and act on them. [Test Case] This can't be reproduced in the lab, even after many attempts. A user sees this issue daily in production, where they run a scalability test - every night, in which they create a new tenant, create all necessary resources + every night, in which they create a new tenant, create all necessary resources (networks, subnets, routers, load balancers, etc.) and start several thousand VMs. They then audit the deployment and verify that everything deployed correctly. Most days there are a small number of VMs that are unreachable, and those VMs have the following messages in neutron-ovn-metadata-agent.log: 2024-02-29 03:33:18.306 1080866 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494 There are test packages available in: https://launchpad.net/~mruffell/+archive/ubuntu/sf375454-updates Some previous test packages have been running in the user's test environment for several months, with zero metadata namespace issues since rollout. We issued the user a hotfix and it has been running in production for the past month and they have also had zero metadata namespace issues since rollout. When this enters -proposed, it will be verified in the user's production environment and subject to their nightly runs of their scalability tests, with the results collected after a week or so of runs. After that we should be confident the -proposed packages fix the issue. + + Additionally, runs will be done with charmed-openstack-tester between + -updates and -proposed to see if there are any differences in test + execution. [Where problems could occur] We are changing ovn-metadata-agent in neutron, and any issues would be limtied to ovn-metadata-agent only. ovn-metadata-agent will now listen for both insert and update notifications by ovsdb-server, instead of just update notifications beforehand. It shouldn't impact any existing functionality. If a regression were to occur, it would affect attaching metadata namespaces to newly created VMs, which prevents it from getting its initial metadata URL / DHCP lease / IP address information, which would cause connectivity issues for newly created VMs. It shouldn't impact any existing VMs. There are no workarounds if a regression were to occur, other than to downgrade the package. [Other info] This was fixed upstream by: commit a641e8aec09c1e33a15a34b19d92675ed2c85682 From: Terry Wilson <[email protected]> Date: Fri, 15 Dec 2023 21:00:43 +0000 Subject: Handle creation of Port_Binding with chassis set Link: https://opendev.org/openstack/neutron/commit/a641e8aec09c1e33a15a34b19d92675ed2c85682 This patch landed in Caracal. The patch is for Zed, Antelope and Bobcat, but it depends on the following commit: commit 6801589510242affc78497660d34377603774074 From: Jakub Libosvar <[email protected]> Date: Thu, 21 Sep 2023 19:40:36 +0000 Subject: ovn-metadata: Refactor events Link: https://opendev.org/openstack/neutron/commit/6801589510242affc78497660d34377603774074 After some discussion, we (mruffell, brian-haley, hopem) decided that it would be too much of a regression risk to backport "ovn-metadata: Refactor events" to Zed, Antelope and Bobcat, we marked this "Won't fix". Now, the user is on yoga, so, Brian Haley wrote a new backport that does not depend on "ovn-metadata: Refactor events" which is the following commit in neutron yoga: commit 952e960414e7c15d4d4351bf2300ce53a69e4051 From: Terry Wilson <[email protected]> Date: Tue, 20 Aug 2024 10:20:52 -0500 Subject: Handle creation of Port_Binding with chassis set Link: https://opendev.org/openstack/neutron/commit/952e960414e7c15d4d4351bf2300ce53a69e4051 This is what we are suggesting for SRU to jammy / yoga. There is a low chance of an upgrade regression for users going from yoga -> zed -> antelope -> bobcat -> caracal (fixed), due to users likely not running heavy stress tests during series upgrade, and would likely run heavy stress tests when they land on caracal instead. If we have to, we will consider zed, antelope, bobcat in the future, but for now, just yoga only. == ORIGINAL DESCRIPTION == Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=2187650 During a scalability test it was noted that a few VMs where having issues being pinged (2 out of ~5000 VMs in the test conducted). After some investigation it was found that the VMs in question did not receive a DHCP lease: udhcpc: no lease, failing FAIL checking http://169.254.169.254/2009-04-04/instance-id failed 1/20: up 181.90. request failed And the ovnmeta- namespaces for the networks that the VMs was booting from were missing. Looking into the ovn-metadata-agent.log: 2023-04-18 06:56:09.864 353474 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 9029c393-5c40-4bf2-beec-27413417eafa or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py:495 Apparently, when the system is under stress (scalability tests) there are some edge cases where the metadata port information has not yet being propagated by OVN to the Southbound database and when the PortBindingChassisEvent event is being handled and try to find either the metadata port of the IP information on it (which is updated by ML2/OVN during subnet creation) it can not be found and fails silently with the error shown above. Note that, running the same tests but with less concurrency did not trigger this issue. So only happens when the system is overloaded.
-- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2017748 Title: [SRU] OVN: ovnmeta namespaces missing during scalability test causing DHCP issues To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/2017748/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
