BACKGROUND
-------------
We were also seeing these log lines on some of our web servers at Etsy:

workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 4 times,
consider switching to WQ_UNBOUND

Kernel version: Linux 6.8.0-1019-gcp #21~22.04.1-Ubuntu SMP x86_64
x86_64 x86_64 GNU/Linux

When this happens, we see it correlated with elevated system CPU:
https://i.imgur.com/yi6tuor.png

In htop, we see kernel workers were using lots of CPU in a function
inode_switch_wbs: https://i.imgur.com/ftajwgb.png

perf recording flame graphs confirm similar:
https://i.imgur.com/s2nswZR.png


DIGGING IN
-------------
We see this elevated system CPU usage 30 seconds to 5 minutes after writing 
thousands of small files to disk (part of our web server bootstrapping 
process). The function inode_switch_wbs_work_fn ( 
https://github.com/torvalds/linux/blob/906bd684e4b1e517dd424a354744c5b0aebef8af/fs/fs-writeback.c#L490
 ) that we believe is causing the elevated system CPU is responsible for moving 
an inode from one cgroup to another in the context of writeback.

Our understanding is: in the context of cgroups, page cache writeback is
tracked at the inode level. A given inode is assigned to whichever
cgroup contains the process that is responsible for the majority of
writes to the inode’s file. If a new process starts writing a lot to a
file, the file’s inode may switch to the new process’s cgroup. Likewise,
if a process managed by systemd is terminated, systemd will remove the
process’s cgroup, at which point any inodes assigned to the process’s
cgroup will be moved to the parent cgroup.

If we have a oneshot systemd service that writes thousands of files to
disk, initially the inodes corresponding to those files are assigned to
the oneshot service's cgroup: https://i.imgur.com/elGeb51.png

After the oneshot systemd service exits, its cgroup is removed, and
inodes are moved to the parent cgroup: https://i.imgur.com/RuL7e7R.png

The inode_switch_wbs_work_fn that we believe is causing the elevated
system CPU is responsible for moving an inode from one cgroup to another
in the context of writeback.


MINIMAL REPRODUCTION
---------------------
sudo mkdir -p /var/random-files && sudo systemd-run --property=Type=oneshot 
bash -c 'dd if=/dev/urandom bs=1024 count=400000 | split -a 16 -b 1k - 
/var/random-files/file.'

This command creates 400,000 files, each consisting of 1,024 random
bytes. The files have names like /var/random-files/file.aaaaaaaaaaaaaaaa
and /var/random-files/file.aaaaaaaaaaaaaaab. This command is run as a
systemd oneshot service. Within anywhere from 30 seconds to 5 minutes
after this command finishes, we see a spike in system CPU. Viewing htop
will confirm this (press shift + k to show kernel threads in htop) – we
see kernel workers using lots of CPU in the function inode_switch_wbs.


DEBUGGING
-------------
Using bpftrace, we can get more debugging information with a tracepoint:

sudo bpftrace -e '
tracepoint:writeback:inode_switch_wbs {
  printf(
        "[%s] inode is switching! inode: %d old cgroup: %d new cgroup: %d\n",
        strftime("%H:%M:%S", nsecs),
        args->ino,
        args->old_cgroup_ino,
        args->new_cgroup_ino
  );
}'


When running our minimal reproduction command from above, we will see one log 
line emitted per file:

[20:49:30] inode is switching! inode: 518312 old cgroup: 7486 new cgroup: 70
[20:49:30] inode is switching! inode: 518311 old cgroup: 7486 new cgroup: 70
...


THE FIX
-------------
We found a systemd directive that allows us to turn off certain cgroup 
accounting features: DisableControllers ( 
https://www.freedesktop.org/software/systemd/man/latest/systemd.resource-control.html#Control%20Group%20Management
 ).

If either the `io` or `memory` controllers are disabled, the kernel will
not perform cgroup writeback or any of the related accounting and cgroup
switching. We found that by creating a systemd slice with those
controllers disabled and configuring our systemd oneshot service that
writes thousands of files to use that slice, we could no longer
reproduce the elevated system CPU. After deploying this fix, we no
longer see elevated system CPU and we no longer have these
"inode_switch_wbs_work_fn hogged CPU" log lines in dmesg.

Example slice:
https://gist.github.com/dasl-/87b849625846aed17f1e4841b04ecc84#file-
dasl-slice-L5

Example oneshot service:
https://gist.github.com/dasl-/3e79603a5558911e3521c519e779c3db#file-
dasl-service-L7

This investigation was a collaboration between myself and my colleagues:
Eric Norris, Tim Korn, Adam Saponara, and Chris Proto

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-oem-6.5 in Ubuntu.
https://bugs.launchpad.net/bugs/2038492

Title:
  workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 16 times,
  consider switching to WQ_UNBOUND

Status in linux-oem-6.5 package in Ubuntu:
  Confirmed

Bug description:
  dmesg has multiple messages:
  [  852.580542] evict_inodes inode 00000000be24a21a, i_count = 1, was skipped!
  [  852.580543] evict_inodes inode 000000002aea522e, i_count = 1, was skipped!
  [  852.580544] evict_inodes inode 0000000048d75a5c, i_count = 1, was skipped!
  [  852.580545] evict_inodes inode 00000000d8a9e4c2, i_count = 1, was skipped!
  [  852.580546] evict_inodes inode 000000003d2c905c, i_count = 1, was skipped!
  [  852.580547] evict_inodes inode 00000000e5b1b232, i_count = 1, was skipped!
  [  852.580548] evict_inodes inode 0000000097383a6b, i_count = 1, was skipped!
  [  852.580549] evict_inodes inode 00000000ca8e2b44, i_count = 1, was skipped!
  [ 1751.869281] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 4 
times, consider switching to WQ_UNBOUND
  [ 1781.467278] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 8 
times, consider switching to WQ_UNBOUND
  [ 1806.065364] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 16 
times, consider switching to WQ_UNBOUND
  [ 1901.993975] evict_inodes inode 00000000abaa740e, i_count = 1, was skipped!
  [ 1901.993981] evict_inodes inode 00000000515b9bf8, i_count = 1, was skipped!
  [ 1901.993983] evict_inodes inode 000000001a69d536, i_count = 1, was skipped!
  [ 1901.993984] evict_inodes inode 000000001403f675, i_count = 1, was skipped!
  [ 1901.993985] evict_inodes inode 00000000757de21a, i_count = 1, was skipped!
  [ 1901.993986] evict_inodes inode 000000000cee9028, i_count = 1, was skipped!
  [ 1901.993987] evict_inodes inode 00000000d2827e77, i_count = 1, was skipped!

  Description:    Ubuntu 22.04.3 LTS
  Release:        22.04

  linux-oem-22.04d:
    Installed: 6.5.0.1004.4
    Candidate: 6.5.0.1004.4


  Seems related to #2037214

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-oem-6.5/+bug/2038492/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to