Hi,

with the latest upload of openvswitch v2.15.0+ds1-2+deb11u2 to
bullseye-security[1] a customer of mine, using a Proxmox Ceph cluster
(PVE 7.3) with openvswitch, got hit by a serious networking issue,
which matches this bug report.

It turned out that a service restart of the openvswitch-switch
service (actually its ovs-vswitchd.service startup underneath)
breaks the corresponding network interface by leaving it in state
DOWN and without IP address. Given that during package
installation/upgrade of the openvswitch-switch package the debhelper
magic in postinst automatically kicks in, such a package upgrade can
break a whole cluster network.

The problem was perfectly reproducible on our AMD CPU powered
cluster itself (the "fix" after the openvswitch-switch/
ovs-vswitchd.service restart was a manual startup up `ifup vmbr1`,
though still breaking the Ceph networking for some time therefore,
so not really pleasant to debug there).

At first the issue was *not* yet reproducible inside VirtualBox VMs
running on top of a Intel CPU (hosts running 5.10 kernels and
VirtualBox 6.x). On a VirtualBox VM running on top of a AMD CPU
(host running kernel 6.1.7 with VirtualBox 7.0.4) the problem could
be reproduced though. But the problem was present iff the VM used
more than one CPU.

We then could also reproduce the problem inside a VM running on top
of our PVE cluster (with CPU types "host" as well as "kvm64" but
even "qemu64"), iff more than one CPU core was assigned. But we
could *not* reproduce the problem on a PVE cluster running on top of
Intel CPUs. Anyways, the bug clearly didn't show up on *any*
single-CPU machine, though could easily be reproduced on systems
with host AMD CPU and if the VM used >=2 CPUs.

With a joint debugging effort of Florian Apolloner, Darshaka Pathirana
and myself we managed to track down the bug to
https://github.com/openvswitch/ovs/commit/bc0aa785a83c11dab482b3e20736b969174d9f86

This commit was part of the v2.15.1 release and hides itself behind the
"Bug fixes" entry of https://www.openvswitch.org/releases/NEWS-2.15.1.txt

It looks like the connection state machine which got introduced by
the IDL split into two layers (see upstream git commit
1c337c43ac1c876) seems to have a concurrency/race condition that can
show up on certain multi-CPU systems.

Good news is, that with aforesaid commit applied this problem no
longer shows up. Attached is the ready-for-use patch in DEB-3 format
(fix_ovsdb-idl_fix-the-database-update-signaling-if-it-has-never-been-connected.patch)
for usage via debian/patches/series on top of the current openvswitch
packaging of v2.15.0+ds1-2+deb11u2.

IMO this is a serious bug of the package present in
bullseye/bullseye-security which should be fixed through a package
update in proposed-updates and therefore the next Debian point
release of bullseye.

Therefore I'm also attaching a debdiff
(openvswitch_v2.15.0+ds1-2+deb11u3.debdiff) between
2.15.0+ds1-2+deb11u2 (current version in bullseye-security) and a
new package version meant for inclusion in bullseye (feel free to
adjust the package version if needed).

Thomas, could you please take care of the corresponding upload
towards bullseye?

[1] 
https://tracker.debian.org/news/1408998/accepted-openvswitch-2150ds1-2deb11u2-source-into-stable-security/

regards
-mika-
From: Ilya Maximets <i.maxim...@ovn.org>
Subject: ovsdb-idl: Fix the database update signaling if it has never been connected
  The symptom of this issue is that OVS bridge looses its IP address on
  restart.
  .
  Simple reproducer:
   0. start ovsdb-server and ovs-vswitchd
   1. ovs-vsctl add-br br0
   2. ifconfig br0 10.0.0.1 up
   3. ovs-appctl -t ovs-vswitchd exit
   4. start ovs-vswitchd back.
  .
  After step #3 ovs-vswitchd is down, but br0 interface exists and
  has configured IP address.  After step #4 there is no IP address
  on the port br0.
  .
  What happened:
  1. ovsdb-cs connects to the database via ovsdb-idl and requests
     database lock.
     --> get_schema for _Server database
     --> lock request
  .
  2. ovsdb-cs receives schema for the _Server database.  And sends
     monitor request.
     <-- schema for _Server
     --> monitor_cond for _Server
  .
  3. ovsdb-cs receives lock reply.
     <-- locked
     At this point ovsdb-cs generates OVSDB_CS_EVENT_TYPE_LOCKED
     event and passes it to ovsdb-idl.  ovsdb-idl increases change_seqno.
  .
  4. ovsdb_idl_has_ever_connected() is 'true' now, because change_seqno
     is not zero.
  .
  5. ovs-vswitchd decides that it has connection with database and
     all the initial data, therefore initiates configuration of bridges.
     bridge_run():ovsdb_idl_has_ever_connected() --> true
  .
  6. Since monitor request for the Open_vSwitch database is not even
     sent yet, the database is empty.  This leads to removal of all the
     ports and all other resources.
  .
  7. When data finally received, ovs-vswitchd re-creates bridges and
     ports, but IP addresses can not be restored.
  .
  While splitting out ovsdb-cs from ovsdb-idl one part of the logic
  was lost.  Particularly, before the split, ovsdb-idl updated
  change_seqno only in MONITORING state.
  .
  Restoring the logic by updating the change_seqno only if may send
  transaction, i.e. lock is ours and ovsdb-cs is in the MONITORING
  state.  This matches with the main purpose of increasing change_seqno
  at this point, i.e. to force the client to re-try the transaction.
  With this change ovsdb_idl_has_ever_connected() remains 'false'
  until the first monitor reply with the actual data received.
  .
  This issue was reported several times during the last couple of weeks.
  .
  Reported-at: https://bugzilla.redhat.com/1968445
  Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/383512.html
  Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2021-June/051222.html
  Fixes: 1c337c43ac1c ("ovsdb-idl: Break into two layers.")
  Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
  Acked-by: Dumitru Ceara <dce...@redhat.com>
  Author: Ilya Maximets <i.maxim...@ovn.org>
  Date: Tue, 8 Jun 2021 15:17:23 +0200
Origin: upstream, https://github.com/openvswitch/ovs/commit/bc0aa785a83c11dab482b3e20736b969174d9f86.patch
Applied-Upstream: 2.15.1
Bug-Debian: https://bugs.debian.org/1008684
Last-Update: 2023-01-26

---
 lib/ovsdb-idl.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git lib/ovsdb-idl.c lib/ovsdb-idl.c
index 2c8a0c9..6241fb4 100644
--- a/lib/ovsdb-idl.c
+++ b/lib/ovsdb-idl.c
@@ -401,9 +401,15 @@ ovsdb_idl_run(struct ovsdb_idl *idl)
             break;
 
         case OVSDB_CS_EVENT_TYPE_LOCKED:
-            /* If the client couldn't run a transaction because it didn't have
-             * the lock, this will encourage it to try again. */
-            idl->change_seqno++;
+            if (ovsdb_cs_may_send_transaction(idl->cs)) {
+                /* If the client couldn't run a transaction because it didn't
+                 * have the lock, this will encourage it to try again. */
+                idl->change_seqno++;
+            } else {
+                /* We're setting up a session, so don't signal that the
+                 * database changed.  Finalizing the session will increment
+                 * change_seqno anyhow. */
+            }
             break;
 
         case OVSDB_CS_EVENT_TYPE_UPDATE:
-- 
2.30.2

diff -Nru openvswitch-2.15.0+ds1/debian/changelog 
openvswitch-2.15.0+ds1/debian/changelog
--- openvswitch-2.15.0+ds1/debian/changelog     2022-10-03 12:59:27.000000000 
+0200
+++ openvswitch-2.15.0+ds1/debian/changelog     2023-01-26 17:17:41.000000000 
+0100
@@ -1,3 +1,12 @@
+openvswitch (2.15.0+ds1-2+deb11u3) bullseye; urgency=medium
+
+  * Non-maintainer upload.
+  * Apply upstream fix
+    https://github.com/openvswitch/ovs/commit/bc0aa785a83c1
+    (Closes: #1008684)
+
+ -- Michael Prokop <m...@debian.org>  Thu, 26 Jan 2023 17:17:41 +0100
+
 openvswitch (2.15.0+ds1-2+deb11u2) bullseye-security; urgency=medium
 
   * Fix ovs-dpctl-top by removing 3 wrong hunks in py3-compat.patch.
diff -Nru 
openvswitch-2.15.0+ds1/debian/patches/fix_ovsdb-idl_fix-the-database-update-signaling-if-it-has-never-been-connected.patch
 
openvswitch-2.15.0+ds1/debian/patches/fix_ovsdb-idl_fix-the-database-update-signaling-if-it-has-never-been-connected.patch
--- 
openvswitch-2.15.0+ds1/debian/patches/fix_ovsdb-idl_fix-the-database-update-signaling-if-it-has-never-been-connected.patch
  1970-01-01 01:00:00.000000000 +0100
+++ 
openvswitch-2.15.0+ds1/debian/patches/fix_ovsdb-idl_fix-the-database-update-signaling-if-it-has-never-been-connected.patch
  2023-01-26 17:17:30.000000000 +0100
@@ -0,0 +1,102 @@
+From: Ilya Maximets <i.maxim...@ovn.org>
+Subject: ovsdb-idl: Fix the database update signaling if it has never been 
connected
+  The symptom of this issue is that OVS bridge looses its IP address on
+  restart.
+  .
+  Simple reproducer:
+   0. start ovsdb-server and ovs-vswitchd
+   1. ovs-vsctl add-br br0
+   2. ifconfig br0 10.0.0.1 up
+   3. ovs-appctl -t ovs-vswitchd exit
+   4. start ovs-vswitchd back.
+  .
+  After step #3 ovs-vswitchd is down, but br0 interface exists and
+  has configured IP address.  After step #4 there is no IP address
+  on the port br0.
+  .
+  What happened:
+  1. ovsdb-cs connects to the database via ovsdb-idl and requests
+     database lock.
+     --> get_schema for _Server database
+     --> lock request
+  .
+  2. ovsdb-cs receives schema for the _Server database.  And sends
+     monitor request.
+     <-- schema for _Server
+     --> monitor_cond for _Server
+  .
+  3. ovsdb-cs receives lock reply.
+     <-- locked
+     At this point ovsdb-cs generates OVSDB_CS_EVENT_TYPE_LOCKED
+     event and passes it to ovsdb-idl.  ovsdb-idl increases change_seqno.
+  .
+  4. ovsdb_idl_has_ever_connected() is 'true' now, because change_seqno
+     is not zero.
+  .
+  5. ovs-vswitchd decides that it has connection with database and
+     all the initial data, therefore initiates configuration of bridges.
+     bridge_run():ovsdb_idl_has_ever_connected() --> true
+  .
+  6. Since monitor request for the Open_vSwitch database is not even
+     sent yet, the database is empty.  This leads to removal of all the
+     ports and all other resources.
+  .
+  7. When data finally received, ovs-vswitchd re-creates bridges and
+     ports, but IP addresses can not be restored.
+  .
+  While splitting out ovsdb-cs from ovsdb-idl one part of the logic
+  was lost.  Particularly, before the split, ovsdb-idl updated
+  change_seqno only in MONITORING state.
+  .
+  Restoring the logic by updating the change_seqno only if may send
+  transaction, i.e. lock is ours and ovsdb-cs is in the MONITORING
+  state.  This matches with the main purpose of increasing change_seqno
+  at this point, i.e. to force the client to re-try the transaction.
+  With this change ovsdb_idl_has_ever_connected() remains 'false'
+  until the first monitor reply with the actual data received.
+  .
+  This issue was reported several times during the last couple of weeks.
+  .
+  Reported-at: https://bugzilla.redhat.com/1968445
+  Reported-at: 
https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/383512.html
+  Reported-at: 
https://mail.openvswitch.org/pipermail/ovs-discuss/2021-June/051222.html
+  Fixes: 1c337c43ac1c ("ovsdb-idl: Break into two layers.")
+  Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
+  Acked-by: Dumitru Ceara <dce...@redhat.com>
+  Author: Ilya Maximets <i.maxim...@ovn.org>
+  Date: Tue, 8 Jun 2021 15:17:23 +0200
+Origin: upstream, 
https://github.com/openvswitch/ovs/commit/bc0aa785a83c11dab482b3e20736b969174d9f86.patch
+Applied-Upstream: 2.15.1
+Bug-Debian: https://bugs.debian.org/1008684
+Last-Update: 2023-01-26
+
+---
+ lib/ovsdb-idl.c | 12 +++++++++---
+ 1 file changed, 9 insertions(+), 3 deletions(-)
+
+diff --git lib/ovsdb-idl.c lib/ovsdb-idl.c
+index 2c8a0c9..6241fb4 100644
+--- a/lib/ovsdb-idl.c
++++ b/lib/ovsdb-idl.c
+@@ -401,9 +401,15 @@ ovsdb_idl_run(struct ovsdb_idl *idl)
+             break;
+ 
+         case OVSDB_CS_EVENT_TYPE_LOCKED:
+-            /* If the client couldn't run a transaction because it didn't have
+-             * the lock, this will encourage it to try again. */
+-            idl->change_seqno++;
++            if (ovsdb_cs_may_send_transaction(idl->cs)) {
++                /* If the client couldn't run a transaction because it didn't
++                 * have the lock, this will encourage it to try again. */
++                idl->change_seqno++;
++            } else {
++                /* We're setting up a session, so don't signal that the
++                 * database changed.  Finalizing the session will increment
++                 * change_seqno anyhow. */
++            }
+             break;
+ 
+         case OVSDB_CS_EVENT_TYPE_UPDATE:
+-- 
+2.30.2
+
diff -Nru openvswitch-2.15.0+ds1/debian/patches/series 
openvswitch-2.15.0+ds1/debian/patches/series
--- openvswitch-2.15.0+ds1/debian/patches/series        2022-10-03 
12:59:27.000000000 +0200
+++ openvswitch-2.15.0+ds1/debian/patches/series        2023-01-26 
17:17:38.000000000 +0100
@@ -3,3 +3,4 @@
 CVE-2021-36980_Fix_use-after-free_while_decoding_RAW_ENCAP.patch
 CVE-2022-4337and8_1_fix_bugs_when_parsing_malformed_LLDP_packets.patch
 CVE-2022-4337and8_2_Add_a_unit_test_for_LLDP.patch
+fix_ovsdb-idl_fix-the-database-update-signaling-if-it-has-never-been-connected.patch

Attachment: signature.asc
Description: PGP signature

Reply via email to