Hi,
qemu-img -t writeback -T writeback is not designed to run with the Linux
cgroup v1 memory controller because dirtying too much page cache leads
to process termination instead of usual non-cgroup and cgroup v2
throttling behavior:
https://bugzilla.redhat.com/show_bug.cgi?id=2196072

I wanted to share my thoughts on this issue.

cache=none bypasses the host page cache and will not hit the cgroup
memory limit. It's an easy solution to avoid exceeding the cgroup v1
memory limit.

However, not all Linux file systems support O_DIRECT and qemu-img's I/O
pattern may perform worse under cache=none than cache=writeback.

1. Which file systems support O_DIRECT in Linux 6.5?

I searched the Linux source code for file systems that implement
.direct_IO or set FMODE_CAN_ODIRECT. This is not exhaustive and may not
be 100% accurate.

The big name file systems (ext4, XFS, btrfs, nfs, smb, ceph) support
O_DIRECT. The most obvious omission is tmpfs.

If your users are running file systems that support O_DIRECT, then
qemu-img -t none -T none is an easy solution to the cgroup v1 memory
limit issue.

Supported:
9p
affs
btrfs
ceph
erofs
exfat
ext2
ext4
f2fs
fat
fuse
gfs2
hfs
hfsplus
jfs
minix
nfs
nilfs2
ntfs3
ocfs2
orangefs
overlayfs
reiserfs
smb
udf
xfs
zonefs

Unsupported:
adfs
befs
bfs
cramfs
ecryptfs
efs
freevxfs
hpfs
hugetlbfs
isofs
jffs2
ntfs
omfs
qnx4
qnx6
ramfs
romfs
squashfs
sysv
tmpfs
ubifs
ufs
vboxsf

2. Is qemu-img performance with O_DIRECT acceptable?

The I/O pattern matters more with O_DIRECT because every I/O request is
sent to the storage device. This means buffer sizes matter more (more
small I/Os have higher overhead than fewer large I/Os). Concurrency can
also help saturate the storage device.

If you switch to O_DIRECT and encounter performance problems then
qemu-img can be optimized to send I/O patterns with less overhead. This
requires performance analysis.

3. Using buffered I/O because O_DIRECT is not universally supported?

If you can't use O_DIRECT, then qemu-img could be extended to manage its
dirty page cache set carefully. This consists of picking a budget and
writing back to disk when the budget is exhausted. Richard Jones has
shared links covering posix_fadvise(2) and sync_file_range(2):
https://lkml.iu.edu/hypermail/linux/kernel/1005.2/01845.html
https://lkml.iu.edu/hypermail/linux/kernel/1005.2/01953.html

We can discuss qemu-img code changes and performance analysis more if
you decide to take that direction.

Hope this helps!

Stefan

Attachment: signature.asc
Description: PGP signature

Reply via email to