On Thu, May 13, 2021 at 04:38:18PM -0400, Nicholas D Steeves wrote: > > > > On 17-04-2021 00:35, Nicholas D Steeves wrote: > >> Last I checked, only btrfs supports the cgroupv2 I/O controller; this > >> should probably be documented. Alternatively, if more than btrfs (ie: > >> XFS and ext4) supports it, but not other file systems (ie: f2fs, > >> reiser4, jfs, etc) than this should be documented.
There are two things which are being confused here. Once is whether you are using cgroup v1 versus cgroup v2, and the other thing is which i/o related cgroup controller are you using. There is the I/O cost model based controller (CONFIG_BLK_CGROUP_IOCOST) and the I/O latency controller (CONFIG_BLK_CGROUP_IOLATENCY). These are two *different* I/O controllers that are supported by cgroup v2. The I/O Cost controller is simpler, and is similar to the cgroup v1 block I/O controller, although it's more complex/featureful. The I/O latency is experimental (from the block/Kconfig: "Note, this is an experimental interface and could be changed someday.") and since it was implemented by Josef Bacik, who is one of the maintainers of btrfs, it was only tested on btrfs (and Facebook workloads). Now, there is nothing specifically file system in any of the I/O controllers, whether we're talking about the cgroup v1 blkio controller, the cgroup v2 I/O cost controller, or cgroup v2 I/O latency, but it turns out I/O controllers are *subtle*. How they interact memory cgroups, and file systems ends up having all sorts of "interesting" interactions. Solving this problem in a general fashion takes a lot of software engineering investment, and I know of only two companies who have done that dedicated work is (a) Google, where we made cgroup v1 block I/O controller work with ext2 in no-journal mode, although we needed to make enough changes to make things work well at the high-levels of memory, cpu, and I/O loads with fine-grained control, we needed to make changes to the cfq I/O scheduler that were rejected by upstream because they were "too complicated" and "insufficiently general", so we ended up forking cfq to create an internal gfq I/O scheduler that we've been carrying for the last 8 years or so, and (b) Facebook, where Josef basically created his own I/O controller, but since it was new and experimental, he only got it working for Facebook's configuration, got it upstream, and he hasn't done much with it since 2018. I'm not saying this to diss on Josef; it's a hard problem, and speaking from experience, solving it takes a particular company's workload can easily be 1-3 focused SWE-years --- and the business case to make that investment more generally hasn't really existed. So why is it that the I/O latency controller has been tested to work well only on btrfs? Two reasons. The first is that the file system has to mark its metadata I/O using the REQ_META flag, so that the metadata doesn't get throttled. Why is this important? Because metadata I/O often happens while holding locks, or happens on kernel threads located in a different (or system) cgroup, on behalf of processes in a different cgroup. So the I/O Latency controller exempts from throttling I/O requests that are marked with the REQ_META or REQ_SWAP (for swap requests). Btrfs and ext4 marks metadata I/O with REQ_META[1]; XFS does not. [1] REQ_META is useful even before the I/O latency controller was introduced, since it allows you to easily identify Metadata I/O using block tracing tool, or if you wanted to collect timing info of metadata I/O using eBPF, etc. The other issue is that ext4 will do synchronous (blocking) data writebacks as part of the commit processing in order to make sure freshly allocated blocks won't accidentally reveal previously deleted data after an crash or power failure. So if the I/O Latency controller throttles I/O's happening inside the ext4 commit thread, it will slow down the commit, and this slows down *all* threads. And if we exempt writebacks from the commit thread, a large enough percentage of writebacks can end up getting exempted that the I/O latency controller doesn't work very well at all. There is nothing *stopping* you from using the I/O latency controller with ext4. However, it may not work well, and you may be very unhappy with the results. Actually, I suspect that if you mounted an ext4 file system without a journal ("mkfs.ext4 -O ^has_journal" which is how Google uses ext4 in our data center servers) or with data=writeback, the I/O latency controller would work *fine*. However, it's not been tested, and so if it breaks, you get to keep both pieces. Of course, running ext4 without a journal that file systems can get corrupted after a crash. Given how Google uses ext4 in our data centers, where it's the back-end store for a cluster file system where we are using erasure coding (e.g., Reed-Solomon encoding) or n=3 replication, the speed benefits are worth it, and we have other ways of protecting against single node failures --- either due to fs corruption after a power failure, or the network router at the top of the rack failing, or a power distribution unit failure taking out an entire row of racks of servers. And data=writeback also has tradeoffs: "A crash+recovery can cause incorrect data to appear in files which were written shortly before the crash." - https://www.kernel.org/doc/html/latest/admin-guide/ext4.html#data-mode > I'm also not sure, because everywhere I've looked appears to document > that this technology is not filesystem specific (except the Facebook > cgroupv2 iogroup announcement which asserts btrfs-only). > > CCing Theodore Ts'o, who will definitely know if ext4 is supported! As described above, under *some* circumstances, ext4 *may* work with I/O latency. But until you test it, you won't know for sure. So more generally, that's the problem with I/O controllers in general. They are fundamentally oblivious to how they interact with other parts of the kernel, and until you do the testing, and perform any remediation that you may find is necessasry, you may end up being surprised when you put it into production use. Josef did some brief testing with XFS and ext4, and noted that it didn't work in their default configuration, and he did whatever work was needed to make it work well for Facebook using btrfs. I'll give you another example that we learned the hard way. Depending on how tight you make your memory cgroups and how tight you constrain your I/O controller, it's possible for write throttling --- where processes which are dirtying memory faster than they can be written out are put to sleep instead of triggering the OOM killer. It turns out write throttling when the total system memory is low is quite different from a particular memory cgroup is low on free memory, and so the complex interactions between the memory cgroup controller and and the I/O cgroup controller is another reason why there appears to be a guaranteed employment act for data center kernel engineers. :-) So even with btrfs, how you configure your system may be quite different from how Facebook configures its data center servers, so I'd encourage you to be careful, and do a lot of careful testing before you deploy the I/O latency cgroup in production, It's not that Block I/O latency controller won't work with any particular file system --- the problem is that it may work too well, or at least not the way you expect, ala the magic broomstick carrying water in Disney's Fantasia[2]. [2] https://video.disney.com/watch/sorcerer-s-apprentice-fantasia-4ea9ebc01a74ea59a5867853 Cheers, - Ted P.S. As far as other file systems (f2fs, jfs, reiserfs, etc.) the same issues apply; if they don't mark metadata I/O with the REQ_META flag, things may not go well. If they do, the next question is whether they are doing a lot of I/O on kernel threads on behalf of various userspace processes. But until you actually test and verify, I would hesitate to make any kind of guarantee.