Public bug reported:

# Our problem

We are running multiple K8S clusters on Ubuntu 24.04.1 LTS nodes.

On one of these clusters, we have noticed at least twice that most of the nodes 
(~5 out of 8) went offline without any action on our side.
To restore connectivity, we tried ifdown/ifup, disconnect/connect network from 
hypervisor and networking service restart but nothing helped, we had to reboot 
the nodes from the console.

After some investigations, we were able to correlate this outage with the 
`apt-daily-upgrade` service run triggered by the `apt-daily-upgrade` timer.
Somehow, the `apt-daily-upgrade` service updated a package which triggered a 
`systemctl daemon-reexec`, cutting network connectivity in the process.

# Symptoms

Node is flagged as `NotReady` by K8s
SSH connection to node is not working
From the node, we can't ping the gateway
The output of `systemctl daemon-reexec` in `journalctl` is way more verbose 
than usual :

```
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Reexecuting requested from client 
PID 2711048 ('systemctl') (unit apt-daily-upgrade.service)...
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Reexecuting.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: systemd 255.4-1ubuntu8.5 running 
in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT 
-GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD 
+LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT +
QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK -XKBCOMMON +UTMP 
+SYSVINIT default-hierarchy=unified)
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Detected virtualization vmware.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Detected architecture x86-64.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Starting man-db.service - Daily 
man-db regeneration...
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping containerd.service - 
containerd container runtime...
févr. 21 06:06:55 lylux0634kdp004 ntpd[1106]: ERR: ntpd exiting on signal 15 
(Terminated)
févr. 21 06:06:55 lylux0634kdp004 ntpd[1106]: PROTO: 172.16.10.254 unlink local 
addr 172.16.34.4 -> <null>
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping ntpsec.service - Network 
Time Service...
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping open-vm-tools.service - 
Service for virtual machines hosted on VMware...
févr. 21 06:06:55 lylux0634kdp004 systemd-journald[504]: Journal stopped
févr. 21 06:06:55 lylux0634kdp004 systemd-journald[504]: Received SIGTERM from 
PID 1 (systemd).
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping systemd-journald.service 
- Journal Service...
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: ntpsec.service: Deactivated 
successfully.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopped ntpsec.service - Network 
Time Service.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: ntpsec.service: Consumed 1min 
12.819s CPU time, 12.4M memory peak, 0B memory swap peak.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Deactivated 
successfully.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit process 
3374 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit process 
3375 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit process 
3475 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit process 
3512 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit process 
3545 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit process 
3618 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit process 
2574706 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopped containerd.service - 
containerd container runtime.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Consumed 9min 
54.298s CPU time, 3.4G memory peak, 0B memory swap peak.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found 
left-over process 3374 (containerd-shim) in control group while starting unit. 
Ignoring.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This usually 
indicates unclean termination of a previous run, or service implementation 
deficiencies.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found 
left-over process 3375 (containerd-shim) in control group while starting unit. 
Ignoring.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This usually 
indicates unclean termination of a previous run, or service implementation 
deficiencies.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found 
left-over process 3475 (containerd-shim) in control group while starting unit. 
Ignoring.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This usually 
indicates unclean termination of a previous run, or service implementation 
deficiencies.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found 
left-over process 3512 (containerd-shim) in control group while starting unit. 
Ignoring.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This usually 
indicates unclean termination of a previous run, or service implementation 
deficiencies.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found 
left-over process 3545 (containerd-shim) in control group while starting unit. 
Ignoring.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This usually 
indicates unclean termination of a previous run, or service implementation 
deficiencies.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found 
left-over process 3618 (containerd-shim) in control group while starting unit. 
Ignoring.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This usually 
indicates unclean termination of a previous run, or service implementation 
deficiencies.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found 
left-over process 2574706 (containerd-shim) in control group while starting 
unit. Ignoring.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This usually 
indicates unclean termination of a previous run, or service implementation 
deficiencies.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Starting containerd.service - 
containerd container runtime...
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: netplan-ovs-cleanup.service - 
OpenVSwitch configuration for cleanup was skipped because of an unmet condition 
check (ConditionFileIsExecutable=/usr/bin/ovs-vsctl).
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Starting ntpsec.service - Network 
Time Service...
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: 
systemd-networkd-wait-online.service: Deactivated successfully.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopped 
systemd-networkd-wait-online.service - Wait for Network to be Configured.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping 
systemd-networkd-wait-online.service - Wait for Network to be Configured...
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping systemd-networkd.service 
- Network Configuration...
```

The `Found left-over process` lines made me think of bug
https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/2013543 but from
my understanding, we whould not be impacted on Noble hosts.

# Testcase

Here is the catch : we can't reproduce the issue on-demand.

When manually running `systemctl daemon-reexec`, we are not experiencing
the same outage and journalctl is only logging 5 lines :

```
févr. 21 11:01:06 lylux0634kdp004 systemd[1]: Reexecuting requested from client 
PID 23296 ('systemctl') (unit session-2.scope)...
févr. 21 11:01:06 lylux0634kdp004 systemd[1]: Reexecuting.
févr. 21 11:01:06 lylux0634kdp004 systemd[1]: systemd 255.4-1ubuntu8.5 running 
in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT 
-GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD 
+LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT >
févr. 21 11:01:06 lylux0634kdp004 systemd[1]: Detected virtualization vmware.
févr. 21 11:01:06 lylux0634kdp004 systemd[1]: Detected architecture x86-64.
```

# Some aditional details

root@lylux0634kdp004:~# lsb_release -d
No LSB modules are available.
Description:    Ubuntu 24.04.1 LTS
root@lylux0634kdp004:~# apt-cache policy systemd
systemd:
  Installé : 255.4-1ubuntu8.5
  Candidat : 255.4-1ubuntu8.5
 Table de version :
 *** 255.4-1ubuntu8.5 500
        500 https://XXXXXX/ubuntu-fr noble-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     255.4-1ubuntu8 500
        500 https://XXXXX/ubuntu-fr noble/main amd64 Packages
root@lylux0634kdp004:~# uname -a
Linux lylux0634kdp004 6.8.0-52-generic #53-Ubuntu SMP PREEMPT_DYNAMIC Sat Jan 
11 00:06:25 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Feel free to request any aditional details that would be of any help in
the troubleshooting of this issue.

Antoine

** Affects: systemd (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to systemd in Ubuntu.
https://bugs.launchpad.net/bugs/2099676

Title:
  Network connectivity loss after systemctl daemon-reexec

Status in systemd package in Ubuntu:
  New

Bug description:
  # Our problem

  We are running multiple K8S clusters on Ubuntu 24.04.1 LTS nodes.

  On one of these clusters, we have noticed at least twice that most of the 
nodes (~5 out of 8) went offline without any action on our side.
  To restore connectivity, we tried ifdown/ifup, disconnect/connect network 
from hypervisor and networking service restart but nothing helped, we had to 
reboot the nodes from the console.

  After some investigations, we were able to correlate this outage with the 
`apt-daily-upgrade` service run triggered by the `apt-daily-upgrade` timer.
  Somehow, the `apt-daily-upgrade` service updated a package which triggered a 
`systemctl daemon-reexec`, cutting network connectivity in the process.

  # Symptoms

  Node is flagged as `NotReady` by K8s
  SSH connection to node is not working
  From the node, we can't ping the gateway
  The output of `systemctl daemon-reexec` in `journalctl` is way more verbose 
than usual :

  ```
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Reexecuting requested from 
client PID 2711048 ('systemctl') (unit apt-daily-upgrade.service)...
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Reexecuting.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: systemd 255.4-1ubuntu8.5 
running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP 
+GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC 
+KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT +
  QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK -XKBCOMMON +UTMP 
+SYSVINIT default-hierarchy=unified)
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Detected virtualization vmware.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Detected architecture x86-64.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Starting man-db.service - Daily 
man-db regeneration...
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping containerd.service - 
containerd container runtime...
  févr. 21 06:06:55 lylux0634kdp004 ntpd[1106]: ERR: ntpd exiting on signal 15 
(Terminated)
  févr. 21 06:06:55 lylux0634kdp004 ntpd[1106]: PROTO: 172.16.10.254 unlink 
local addr 172.16.34.4 -> <null>
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping ntpsec.service - 
Network Time Service...
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping open-vm-tools.service 
- Service for virtual machines hosted on VMware...
  févr. 21 06:06:55 lylux0634kdp004 systemd-journald[504]: Journal stopped
  févr. 21 06:06:55 lylux0634kdp004 systemd-journald[504]: Received SIGTERM 
from PID 1 (systemd).
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping 
systemd-journald.service - Journal Service...
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: ntpsec.service: Deactivated 
successfully.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopped ntpsec.service - 
Network Time Service.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: ntpsec.service: Consumed 1min 
12.819s CPU time, 12.4M memory peak, 0B memory swap peak.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Deactivated 
successfully.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit 
process 3374 (containerd-shim) remains running after unit stopped.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit 
process 3375 (containerd-shim) remains running after unit stopped.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit 
process 3475 (containerd-shim) remains running after unit stopped.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit 
process 3512 (containerd-shim) remains running after unit stopped.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit 
process 3545 (containerd-shim) remains running after unit stopped.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit 
process 3618 (containerd-shim) remains running after unit stopped.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit 
process 2574706 (containerd-shim) remains running after unit stopped.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopped containerd.service - 
containerd container runtime.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Consumed 
9min 54.298s CPU time, 3.4G memory peak, 0B memory swap peak.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found 
left-over process 3374 (containerd-shim) in control group while starting unit. 
Ignoring.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found 
left-over process 3375 (containerd-shim) in control group while starting unit. 
Ignoring.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found 
left-over process 3475 (containerd-shim) in control group while starting unit. 
Ignoring.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found 
left-over process 3512 (containerd-shim) in control group while starting unit. 
Ignoring.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found 
left-over process 3545 (containerd-shim) in control group while starting unit. 
Ignoring.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found 
left-over process 3618 (containerd-shim) in control group while starting unit. 
Ignoring.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found 
left-over process 2574706 (containerd-shim) in control group while starting 
unit. Ignoring.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This 
usually indicates unclean termination of a previous run, or service 
implementation deficiencies.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Starting containerd.service - 
containerd container runtime...
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: netplan-ovs-cleanup.service - 
OpenVSwitch configuration for cleanup was skipped because of an unmet condition 
check (ConditionFileIsExecutable=/usr/bin/ovs-vsctl).
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Starting ntpsec.service - 
Network Time Service...
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: 
systemd-networkd-wait-online.service: Deactivated successfully.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopped 
systemd-networkd-wait-online.service - Wait for Network to be Configured.
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping 
systemd-networkd-wait-online.service - Wait for Network to be Configured...
  févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping 
systemd-networkd.service - Network Configuration...
  ```

  The `Found left-over process` lines made me think of bug
  https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/2013543 but
  from my understanding, we whould not be impacted on Noble hosts.

  # Testcase

  Here is the catch : we can't reproduce the issue on-demand.

  When manually running `systemctl daemon-reexec`, we are not
  experiencing the same outage and journalctl is only logging 5 lines :

  ```
  févr. 21 11:01:06 lylux0634kdp004 systemd[1]: Reexecuting requested from 
client PID 23296 ('systemctl') (unit session-2.scope)...
  févr. 21 11:01:06 lylux0634kdp004 systemd[1]: Reexecuting.
  févr. 21 11:01:06 lylux0634kdp004 systemd[1]: systemd 255.4-1ubuntu8.5 
running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP 
+GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC 
+KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT >
  févr. 21 11:01:06 lylux0634kdp004 systemd[1]: Detected virtualization vmware.
  févr. 21 11:01:06 lylux0634kdp004 systemd[1]: Detected architecture x86-64.
  ```

  # Some aditional details

  root@lylux0634kdp004:~# lsb_release -d
  No LSB modules are available.
  Description:    Ubuntu 24.04.1 LTS
  root@lylux0634kdp004:~# apt-cache policy systemd
  systemd:
    Installé : 255.4-1ubuntu8.5
    Candidat : 255.4-1ubuntu8.5
   Table de version :
   *** 255.4-1ubuntu8.5 500
          500 https://XXXXXX/ubuntu-fr noble-updates/main amd64 Packages
          100 /var/lib/dpkg/status
       255.4-1ubuntu8 500
          500 https://XXXXX/ubuntu-fr noble/main amd64 Packages
  root@lylux0634kdp004:~# uname -a
  Linux lylux0634kdp004 6.8.0-52-generic #53-Ubuntu SMP PREEMPT_DYNAMIC Sat Jan 
11 00:06:25 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

  Feel free to request any aditional details that would be of any help
  in the troubleshooting of this issue.

  Antoine

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/2099676/+subscriptions


-- 
Mailing list: https://launchpad.net/~touch-packages
Post to     : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to