Public bug reported:

We run multiple Ubuntu servers that have the Internet full route table
on its kernel route table. We started upgrade process to 26.04 LTS from
22.04 LTS recently, and observed multiple  instabilities due to resource
exhaustions triggered by various system processes which did not exist in
22.04.

## Cause

We have 2 observations that causes the exhaustions. Both accidentially
enumerating kernel route table to detect internet reachablity where
route enumeration is not appropriate:

1. landscape-common /etc/update-motd.d/50-landscape-sysinfo calls
python3-netifaces function that enumerates full kernel route table.

   - This causes update-motd to stall, which effectively prevents shell
login, even with CPU exhaustion on a server.

2. fwupd and unattended-upgrades call NetworkMonitor get_default, which
is expensive, to determine internet reachability.

   - Periodic updates (fwupd-update.timer, unattended-upgrades) cause
CPU/RAM excessive usage.

## Root cause


1. landscape-common uses python3-netifaces, which enumerate full route table 
entries to detect default network interface. The call path exists in 
/etc/update-motd.d/50-landscape-sysinfo

   - 
https://github.com/canonical/landscape-client/blob/9cfa2458f1a2ef6b28fe4f7740031df5410a4f9b/landscape/lib/network.py#L127
   - 
https://salsa.debian.org/python-team/packages/netifaces/-/blob/ffd1f927a289e2bc2defa19f637a6d0e31cf57b8/netifaces.c#L1778

2. fwupd and unattended-upgrades call glib/gio's
NetworkMonitor.get_default, which *immediately* pulls all routes from
kernel and subscribes to update. It is less performant and risky of
excessive CPU/RAM usage to do frequently (or exhauses resource and never
completes)

   - 
https://github.com/mvo5/unattended-upgrades/blob/26ae30dd42ee30ab4cc2e50f9d794f1fa8730f2e/unattended-upgrade#L907
   - https://github.com/fwupd/fwupd/pull/8275
   - 
https://gitlab.gnome.org/GNOME/glib/-/blob/7a314ecee2663d50dd776672a43e58d398b7dd50/gio/gnetworkmonitornetlink.c#L177


## Possible fix

- Don't enumerate kernel route table (at least it is not appropriate to do in 
packages installed out-of-the-box).
  - Use NetworkMonitor.get_default and python3-netifaces (which is 
unmaintained).
  - If there's known destination address, do netlink call to let kernel resolve 
route (equivalent to `ip route get 1.1.1.1`)
- Have a reasonable timeout such as 200ms.

## Misc

- NetworkMonitor library is confusing its consumer, and letting end users 
surprise because developers can't notice get_default() could be expensive. I am 
guessing this is by design, because NetworkMonitor wants to *monitor* so it 
pulls all routes and subscribe to event updates using netlink.
  - Maybe it has no problem if NetworkManager is running instead of 
systemd-networkd, and it is majority for most previous NetworkMonitor usage?

** Affects: landscape-client
     Importance: Undecided
         Status: New

** Affects: fwupd (Ubuntu)
     Importance: Undecided
         Status: New

** Affects: unattended-upgrades (Ubuntu)
     Importance: Undecided
         Status: New

** Also affects: fwupd (Ubuntu)
   Importance: Undecided
       Status: New

** Also affects: unattended-upgrades (Ubuntu)
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2157046

Title:
  CPU/RAM exhaustion triggered by system process on 26.04 with large
  kernel route table

To manage notifications about this bug go to:
https://bugs.launchpad.net/landscape-client/+bug/2157046/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to