BACKGROUND ------------- We were also seeing these log lines on some of our web servers at Etsy:
workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND Kernel version: Linux 6.8.0-1019-gcp #21~22.04.1-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux When this happens, we see it correlated with elevated system CPU: https://i.imgur.com/yi6tuor.png In htop, we see kernel workers were using lots of CPU in a function inode_switch_wbs: https://i.imgur.com/ftajwgb.png perf recording flame graphs confirm similar: https://i.imgur.com/s2nswZR.png DIGGING IN ------------- We see this elevated system CPU usage 30 seconds to 5 minutes after writing thousands of small files to disk (part of our web server bootstrapping process). The function inode_switch_wbs_work_fn ( https://github.com/torvalds/linux/blob/906bd684e4b1e517dd424a354744c5b0aebef8af/fs/fs-writeback.c#L490 ) that we believe is causing the elevated system CPU is responsible for moving an inode from one cgroup to another in the context of writeback. Our understanding is: in the context of cgroups, page cache writeback is tracked at the inode level. A given inode is assigned to whichever cgroup contains the process that is responsible for the majority of writes to the inode’s file. If a new process starts writing a lot to a file, the file’s inode may switch to the new process’s cgroup. Likewise, if a process managed by systemd is terminated, systemd will remove the process’s cgroup, at which point any inodes assigned to the process’s cgroup will be moved to the parent cgroup. If we have a oneshot systemd service that writes thousands of files to disk, initially the inodes corresponding to those files are assigned to the oneshot service's cgroup: https://i.imgur.com/elGeb51.png After the oneshot systemd service exits, its cgroup is removed, and inodes are moved to the parent cgroup: https://i.imgur.com/RuL7e7R.png The inode_switch_wbs_work_fn that we believe is causing the elevated system CPU is responsible for moving an inode from one cgroup to another in the context of writeback. MINIMAL REPRODUCTION --------------------- sudo mkdir -p /var/random-files && sudo systemd-run --property=Type=oneshot bash -c 'dd if=/dev/urandom bs=1024 count=400000 | split -a 16 -b 1k - /var/random-files/file.' This command creates 400,000 files, each consisting of 1,024 random bytes. The files have names like /var/random-files/file.aaaaaaaaaaaaaaaa and /var/random-files/file.aaaaaaaaaaaaaaab. This command is run as a systemd oneshot service. Within anywhere from 30 seconds to 5 minutes after this command finishes, we see a spike in system CPU. Viewing htop will confirm this (press shift + k to show kernel threads in htop) – we see kernel workers using lots of CPU in the function inode_switch_wbs. DEBUGGING ------------- Using bpftrace, we can get more debugging information with a tracepoint: sudo bpftrace -e ' tracepoint:writeback:inode_switch_wbs { printf( "[%s] inode is switching! inode: %d old cgroup: %d new cgroup: %d\n", strftime("%H:%M:%S", nsecs), args->ino, args->old_cgroup_ino, args->new_cgroup_ino ); }' When running our minimal reproduction command from above, we will see one log line emitted per file: [20:49:30] inode is switching! inode: 518312 old cgroup: 7486 new cgroup: 70 [20:49:30] inode is switching! inode: 518311 old cgroup: 7486 new cgroup: 70 ... THE FIX ------------- We found a systemd directive that allows us to turn off certain cgroup accounting features: DisableControllers ( https://www.freedesktop.org/software/systemd/man/latest/systemd.resource-control.html#Control%20Group%20Management ). If either the `io` or `memory` controllers are disabled, the kernel will not perform cgroup writeback or any of the related accounting and cgroup switching. We found that by creating a systemd slice with those controllers disabled and configuring our systemd oneshot service that writes thousands of files to use that slice, we could no longer reproduce the elevated system CPU. After deploying this fix, we no longer see elevated system CPU and we no longer have these "inode_switch_wbs_work_fn hogged CPU" log lines in dmesg. Example slice: https://gist.github.com/dasl-/87b849625846aed17f1e4841b04ecc84#file- dasl-slice-L5 Example oneshot service: https://gist.github.com/dasl-/3e79603a5558911e3521c519e779c3db#file- dasl-service-L7 This investigation was a collaboration between myself and my colleagues: Eric Norris, Tim Korn, Adam Saponara, and Chris Proto -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-oem-6.5 in Ubuntu. https://bugs.launchpad.net/bugs/2038492 Title: workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND Status in linux-oem-6.5 package in Ubuntu: Confirmed Bug description: dmesg has multiple messages: [ 852.580542] evict_inodes inode 00000000be24a21a, i_count = 1, was skipped! [ 852.580543] evict_inodes inode 000000002aea522e, i_count = 1, was skipped! [ 852.580544] evict_inodes inode 0000000048d75a5c, i_count = 1, was skipped! [ 852.580545] evict_inodes inode 00000000d8a9e4c2, i_count = 1, was skipped! [ 852.580546] evict_inodes inode 000000003d2c905c, i_count = 1, was skipped! [ 852.580547] evict_inodes inode 00000000e5b1b232, i_count = 1, was skipped! [ 852.580548] evict_inodes inode 0000000097383a6b, i_count = 1, was skipped! [ 852.580549] evict_inodes inode 00000000ca8e2b44, i_count = 1, was skipped! [ 1751.869281] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND [ 1781.467278] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND [ 1806.065364] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND [ 1901.993975] evict_inodes inode 00000000abaa740e, i_count = 1, was skipped! [ 1901.993981] evict_inodes inode 00000000515b9bf8, i_count = 1, was skipped! [ 1901.993983] evict_inodes inode 000000001a69d536, i_count = 1, was skipped! [ 1901.993984] evict_inodes inode 000000001403f675, i_count = 1, was skipped! [ 1901.993985] evict_inodes inode 00000000757de21a, i_count = 1, was skipped! [ 1901.993986] evict_inodes inode 000000000cee9028, i_count = 1, was skipped! [ 1901.993987] evict_inodes inode 00000000d2827e77, i_count = 1, was skipped! Description: Ubuntu 22.04.3 LTS Release: 22.04 linux-oem-22.04d: Installed: 6.5.0.1004.4 Candidate: 6.5.0.1004.4 Seems related to #2037214 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-oem-6.5/+bug/2038492/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp