Hi, with the latest upload of openvswitch v2.15.0+ds1-2+deb11u2 to bullseye-security[1] a customer of mine, using a Proxmox Ceph cluster (PVE 7.3) with openvswitch, got hit by a serious networking issue, which matches this bug report.
It turned out that a service restart of the openvswitch-switch service (actually its ovs-vswitchd.service startup underneath) breaks the corresponding network interface by leaving it in state DOWN and without IP address. Given that during package installation/upgrade of the openvswitch-switch package the debhelper magic in postinst automatically kicks in, such a package upgrade can break a whole cluster network. The problem was perfectly reproducible on our AMD CPU powered cluster itself (the "fix" after the openvswitch-switch/ ovs-vswitchd.service restart was a manual startup up `ifup vmbr1`, though still breaking the Ceph networking for some time therefore, so not really pleasant to debug there). At first the issue was *not* yet reproducible inside VirtualBox VMs running on top of a Intel CPU (hosts running 5.10 kernels and VirtualBox 6.x). On a VirtualBox VM running on top of a AMD CPU (host running kernel 6.1.7 with VirtualBox 7.0.4) the problem could be reproduced though. But the problem was present iff the VM used more than one CPU. We then could also reproduce the problem inside a VM running on top of our PVE cluster (with CPU types "host" as well as "kvm64" but even "qemu64"), iff more than one CPU core was assigned. But we could *not* reproduce the problem on a PVE cluster running on top of Intel CPUs. Anyways, the bug clearly didn't show up on *any* single-CPU machine, though could easily be reproduced on systems with host AMD CPU and if the VM used >=2 CPUs. With a joint debugging effort of Florian Apolloner, Darshaka Pathirana and myself we managed to track down the bug to https://github.com/openvswitch/ovs/commit/bc0aa785a83c11dab482b3e20736b969174d9f86 This commit was part of the v2.15.1 release and hides itself behind the "Bug fixes" entry of https://www.openvswitch.org/releases/NEWS-2.15.1.txt It looks like the connection state machine which got introduced by the IDL split into two layers (see upstream git commit 1c337c43ac1c876) seems to have a concurrency/race condition that can show up on certain multi-CPU systems. Good news is, that with aforesaid commit applied this problem no longer shows up. Attached is the ready-for-use patch in DEB-3 format (fix_ovsdb-idl_fix-the-database-update-signaling-if-it-has-never-been-connected.patch) for usage via debian/patches/series on top of the current openvswitch packaging of v2.15.0+ds1-2+deb11u2. IMO this is a serious bug of the package present in bullseye/bullseye-security which should be fixed through a package update in proposed-updates and therefore the next Debian point release of bullseye. Therefore I'm also attaching a debdiff (openvswitch_v2.15.0+ds1-2+deb11u3.debdiff) between 2.15.0+ds1-2+deb11u2 (current version in bullseye-security) and a new package version meant for inclusion in bullseye (feel free to adjust the package version if needed). Thomas, could you please take care of the corresponding upload towards bullseye? [1] https://tracker.debian.org/news/1408998/accepted-openvswitch-2150ds1-2deb11u2-source-into-stable-security/ regards -mika-
From: Ilya Maximets <i.maxim...@ovn.org> Subject: ovsdb-idl: Fix the database update signaling if it has never been connected The symptom of this issue is that OVS bridge looses its IP address on restart. . Simple reproducer: 0. start ovsdb-server and ovs-vswitchd 1. ovs-vsctl add-br br0 2. ifconfig br0 10.0.0.1 up 3. ovs-appctl -t ovs-vswitchd exit 4. start ovs-vswitchd back. . After step #3 ovs-vswitchd is down, but br0 interface exists and has configured IP address. After step #4 there is no IP address on the port br0. . What happened: 1. ovsdb-cs connects to the database via ovsdb-idl and requests database lock. --> get_schema for _Server database --> lock request . 2. ovsdb-cs receives schema for the _Server database. And sends monitor request. <-- schema for _Server --> monitor_cond for _Server . 3. ovsdb-cs receives lock reply. <-- locked At this point ovsdb-cs generates OVSDB_CS_EVENT_TYPE_LOCKED event and passes it to ovsdb-idl. ovsdb-idl increases change_seqno. . 4. ovsdb_idl_has_ever_connected() is 'true' now, because change_seqno is not zero. . 5. ovs-vswitchd decides that it has connection with database and all the initial data, therefore initiates configuration of bridges. bridge_run():ovsdb_idl_has_ever_connected() --> true . 6. Since monitor request for the Open_vSwitch database is not even sent yet, the database is empty. This leads to removal of all the ports and all other resources. . 7. When data finally received, ovs-vswitchd re-creates bridges and ports, but IP addresses can not be restored. . While splitting out ovsdb-cs from ovsdb-idl one part of the logic was lost. Particularly, before the split, ovsdb-idl updated change_seqno only in MONITORING state. . Restoring the logic by updating the change_seqno only if may send transaction, i.e. lock is ours and ovsdb-cs is in the MONITORING state. This matches with the main purpose of increasing change_seqno at this point, i.e. to force the client to re-try the transaction. With this change ovsdb_idl_has_ever_connected() remains 'false' until the first monitor reply with the actual data received. . This issue was reported several times during the last couple of weeks. . Reported-at: https://bugzilla.redhat.com/1968445 Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/383512.html Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2021-June/051222.html Fixes: 1c337c43ac1c ("ovsdb-idl: Break into two layers.") Signed-off-by: Ilya Maximets <i.maxim...@ovn.org> Acked-by: Dumitru Ceara <dce...@redhat.com> Author: Ilya Maximets <i.maxim...@ovn.org> Date: Tue, 8 Jun 2021 15:17:23 +0200 Origin: upstream, https://github.com/openvswitch/ovs/commit/bc0aa785a83c11dab482b3e20736b969174d9f86.patch Applied-Upstream: 2.15.1 Bug-Debian: https://bugs.debian.org/1008684 Last-Update: 2023-01-26 --- lib/ovsdb-idl.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git lib/ovsdb-idl.c lib/ovsdb-idl.c index 2c8a0c9..6241fb4 100644 --- a/lib/ovsdb-idl.c +++ b/lib/ovsdb-idl.c @@ -401,9 +401,15 @@ ovsdb_idl_run(struct ovsdb_idl *idl) break; case OVSDB_CS_EVENT_TYPE_LOCKED: - /* If the client couldn't run a transaction because it didn't have - * the lock, this will encourage it to try again. */ - idl->change_seqno++; + if (ovsdb_cs_may_send_transaction(idl->cs)) { + /* If the client couldn't run a transaction because it didn't + * have the lock, this will encourage it to try again. */ + idl->change_seqno++; + } else { + /* We're setting up a session, so don't signal that the + * database changed. Finalizing the session will increment + * change_seqno anyhow. */ + } break; case OVSDB_CS_EVENT_TYPE_UPDATE: -- 2.30.2
diff -Nru openvswitch-2.15.0+ds1/debian/changelog openvswitch-2.15.0+ds1/debian/changelog --- openvswitch-2.15.0+ds1/debian/changelog 2022-10-03 12:59:27.000000000 +0200 +++ openvswitch-2.15.0+ds1/debian/changelog 2023-01-26 17:17:41.000000000 +0100 @@ -1,3 +1,12 @@ +openvswitch (2.15.0+ds1-2+deb11u3) bullseye; urgency=medium + + * Non-maintainer upload. + * Apply upstream fix + https://github.com/openvswitch/ovs/commit/bc0aa785a83c1 + (Closes: #1008684) + + -- Michael Prokop <m...@debian.org> Thu, 26 Jan 2023 17:17:41 +0100 + openvswitch (2.15.0+ds1-2+deb11u2) bullseye-security; urgency=medium * Fix ovs-dpctl-top by removing 3 wrong hunks in py3-compat.patch. diff -Nru openvswitch-2.15.0+ds1/debian/patches/fix_ovsdb-idl_fix-the-database-update-signaling-if-it-has-never-been-connected.patch openvswitch-2.15.0+ds1/debian/patches/fix_ovsdb-idl_fix-the-database-update-signaling-if-it-has-never-been-connected.patch --- openvswitch-2.15.0+ds1/debian/patches/fix_ovsdb-idl_fix-the-database-update-signaling-if-it-has-never-been-connected.patch 1970-01-01 01:00:00.000000000 +0100 +++ openvswitch-2.15.0+ds1/debian/patches/fix_ovsdb-idl_fix-the-database-update-signaling-if-it-has-never-been-connected.patch 2023-01-26 17:17:30.000000000 +0100 @@ -0,0 +1,102 @@ +From: Ilya Maximets <i.maxim...@ovn.org> +Subject: ovsdb-idl: Fix the database update signaling if it has never been connected + The symptom of this issue is that OVS bridge looses its IP address on + restart. + . + Simple reproducer: + 0. start ovsdb-server and ovs-vswitchd + 1. ovs-vsctl add-br br0 + 2. ifconfig br0 10.0.0.1 up + 3. ovs-appctl -t ovs-vswitchd exit + 4. start ovs-vswitchd back. + . + After step #3 ovs-vswitchd is down, but br0 interface exists and + has configured IP address. After step #4 there is no IP address + on the port br0. + . + What happened: + 1. ovsdb-cs connects to the database via ovsdb-idl and requests + database lock. + --> get_schema for _Server database + --> lock request + . + 2. ovsdb-cs receives schema for the _Server database. And sends + monitor request. + <-- schema for _Server + --> monitor_cond for _Server + . + 3. ovsdb-cs receives lock reply. + <-- locked + At this point ovsdb-cs generates OVSDB_CS_EVENT_TYPE_LOCKED + event and passes it to ovsdb-idl. ovsdb-idl increases change_seqno. + . + 4. ovsdb_idl_has_ever_connected() is 'true' now, because change_seqno + is not zero. + . + 5. ovs-vswitchd decides that it has connection with database and + all the initial data, therefore initiates configuration of bridges. + bridge_run():ovsdb_idl_has_ever_connected() --> true + . + 6. Since monitor request for the Open_vSwitch database is not even + sent yet, the database is empty. This leads to removal of all the + ports and all other resources. + . + 7. When data finally received, ovs-vswitchd re-creates bridges and + ports, but IP addresses can not be restored. + . + While splitting out ovsdb-cs from ovsdb-idl one part of the logic + was lost. Particularly, before the split, ovsdb-idl updated + change_seqno only in MONITORING state. + . + Restoring the logic by updating the change_seqno only if may send + transaction, i.e. lock is ours and ovsdb-cs is in the MONITORING + state. This matches with the main purpose of increasing change_seqno + at this point, i.e. to force the client to re-try the transaction. + With this change ovsdb_idl_has_ever_connected() remains 'false' + until the first monitor reply with the actual data received. + . + This issue was reported several times during the last couple of weeks. + . + Reported-at: https://bugzilla.redhat.com/1968445 + Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/383512.html + Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2021-June/051222.html + Fixes: 1c337c43ac1c ("ovsdb-idl: Break into two layers.") + Signed-off-by: Ilya Maximets <i.maxim...@ovn.org> + Acked-by: Dumitru Ceara <dce...@redhat.com> + Author: Ilya Maximets <i.maxim...@ovn.org> + Date: Tue, 8 Jun 2021 15:17:23 +0200 +Origin: upstream, https://github.com/openvswitch/ovs/commit/bc0aa785a83c11dab482b3e20736b969174d9f86.patch +Applied-Upstream: 2.15.1 +Bug-Debian: https://bugs.debian.org/1008684 +Last-Update: 2023-01-26 + +--- + lib/ovsdb-idl.c | 12 +++++++++--- + 1 file changed, 9 insertions(+), 3 deletions(-) + +diff --git lib/ovsdb-idl.c lib/ovsdb-idl.c +index 2c8a0c9..6241fb4 100644 +--- a/lib/ovsdb-idl.c ++++ b/lib/ovsdb-idl.c +@@ -401,9 +401,15 @@ ovsdb_idl_run(struct ovsdb_idl *idl) + break; + + case OVSDB_CS_EVENT_TYPE_LOCKED: +- /* If the client couldn't run a transaction because it didn't have +- * the lock, this will encourage it to try again. */ +- idl->change_seqno++; ++ if (ovsdb_cs_may_send_transaction(idl->cs)) { ++ /* If the client couldn't run a transaction because it didn't ++ * have the lock, this will encourage it to try again. */ ++ idl->change_seqno++; ++ } else { ++ /* We're setting up a session, so don't signal that the ++ * database changed. Finalizing the session will increment ++ * change_seqno anyhow. */ ++ } + break; + + case OVSDB_CS_EVENT_TYPE_UPDATE: +-- +2.30.2 + diff -Nru openvswitch-2.15.0+ds1/debian/patches/series openvswitch-2.15.0+ds1/debian/patches/series --- openvswitch-2.15.0+ds1/debian/patches/series 2022-10-03 12:59:27.000000000 +0200 +++ openvswitch-2.15.0+ds1/debian/patches/series 2023-01-26 17:17:38.000000000 +0100 @@ -3,3 +3,4 @@ CVE-2021-36980_Fix_use-after-free_while_decoding_RAW_ENCAP.patch CVE-2022-4337and8_1_fix_bugs_when_parsing_malformed_LLDP_packets.patch CVE-2022-4337and8_2_Add_a_unit_test_for_LLDP.patch +fix_ovsdb-idl_fix-the-database-update-signaling-if-it-has-never-been-connected.patch
signature.asc
Description: PGP signature