On 2/3/2026 3:12 PM, Stefan Hajnoczi wrote:
On Fri, Jan 23, 2026 at 01:15:04PM -0600, JAEHOON KIM wrote:
On 1/19/2026 12:16 PM, Stefan Hajnoczi wrote:
On Tue, Jan 13, 2026 at 11:48:21AM -0600, Jaehoon Kim wrote:
We evaluated the patches on an s390x host with a single guest using 16
virtio block devices backed by FCP multipath devices in a separate-disk
setup, with the I/O scheduler set to 'none' in both host and guest.

The fio workload included sequential and random read/write with varying
numbers of jobs (1,4,8,16) and io_depth of 8. The tests were conducted
with single and dual iothreads, using the newly introduced poll-weight
parameter to measure their impact on CPU cost and throughput.

Compared to the baseline, across four FIO workload patterns (sequential
R/W, random R/W), and averaged over FIO job counts of 1, 4, 8, and 16,
throughput decreased slightly (-3% to -8% for one iothread, -2% to -5%
for two iothreads), while CPU usage on the s390x host dropped
significantly (-10% to -25% and -7% to -12%, respectively).
Hi Jaehoon,
I would like to run the same fio benchmarks on a local NVMe drive (<10us
request latency) to see how that type of hardware configuration is
affected. Are the scripts and fio job files available somewhere?

Thanks,
Stefan
Thank you for your reply.
The fio scripts are not available in a location you can access, but there is 
nothing particularly special in the settings.
I’m sharing below the methodology and test setup used by our performance team.

Guest Setup
----------------------
- 12 vCPUs, 4 GiB memory
- 16 virtio disks based on the FCP multipath devices in the host

FIO test parameters
-----------------------
- FIO Version: fio-3.33
- Filesize: 2G
- Blocksize: 8K / 128K
- Direct I/O: 1
- FIO I/O Engine: libaio
- NUMJOB List: 1, 4, 8, 16
- IODEPTH: 8
- Runtime (s): 150

Two FIO samples for random read
--------------------------------
fio --direct=1 --name=test --numjobs=16 
--filename=base.0.0:base.1.0:base.2.0:base.3.0:base.4.0:base.5.0:base.6.0:base.7.0:base.8.0:base.9.0:base.10.0:base.11.0:base.12.0:base.13.0:base.14.0:base.15.0
 --size=32G  --time_based --runtime=4m --readwrite=randread --ioengine=libaio 
--iodepth=8 --bs=8k
fio --direct=1 --name=test --numjobs=4  
--filename=subw1/base.0.0:subw4/base.3.0:subw8/base.7.0:subw12/base.11.0:subw16/base.15.0
                                                                        
--size=8G   --time_based --runtime=4m --readwrite=randread --ioengine=libaio 
--iodepth=8 --bs=8k


additional notes
----------------
- Each file is placed on a separate disk device mounted under subw<n> as 
specified in --filename=....
- We execute one warmup run, then two measurement runs and calculate the average
Hi Jaehoon,
I ran fio benchmarks on an Intel Optane SSD DC P4800X Series drive (<10
microsecond latency). This is with just 1 drive.

The 8 KiB block size results show something similar to what you
reported: there are IOPS (or throughput) regressions and CPU utilization
improvements.

Although the CPU improvements are welcome, I think the default behavior
should only be changed if the IOPS regressions can be brought below 5%.

The regressions seem to happen regardless of whether 1 or 2 IOThreads
are configured. CPU utilization is different (98% vs 78%) depending on
the number of IOThreads, so the regressions happen across a range of CPU
utilizations.

The 128 KiB block size results are not interesting because the drive
already saturates at numjobs=1. This is expected since the drive cannot
go much above ~2 GiB/s throughput.

You can find the Ansible playbook, libvirt domain XML, fio
command-lines, and the fio/sar data here:

https://gitlab.com/stefanha/virt-playbooks/-/tree/aio-polling-efficiency

Please let me know if you'd like me to rerun the benchmark with new
patches or a configuration change.

Do you want to have a video call to discuss your work and how to get the
patches merged?

Host
----
CPU: Intel Xeon Silver 4214 CPU @ 2.20GHz
RAM: 32 GiB

Guest
-----
vCPUs: 8
RAM: 4 GiB
Disk: 1 virtio-blk aio=native cache=none

IOPS
----
rw        bs   numjobs iothreads iops   diff
randread  8k   1       1         163417 -7.8%
randread  8k   1       2         165041 -2.4%
randread  8k   4       1         221508 -0.64%
randread  8k   4       2         251298 0.008%
randread  8k   8       1         222128 -0.51%
randread  8k   8       2         249489 -2.6%
randread  8k   16      1         230535 -0.18%
randread  8k   16      2         246732 -0.22%
randread  128k 1       1          17616 -0.11%
randread  128k 1       2          17678 0.027%
randread  128k 4       1          17536 -0.27%
randread  128k 4       2          17610 -0.031%
randread  128k 8       1          17369 -0.42%
randread  128k 8       2          17433 -0.071%
randread  128k 16      1          17215 -0.61%
randread  128k 16      2          17269 -0.22%
randwrite 8k   1       1         156597 -3.1%
randwrite 8k   1       2         157720 -3.8%
randwrite 8k   4       1         218448 -0.5%
randwrite 8k   4       2         247075 -5.1%
randwrite 8k   8       1         220866 -0.75%
randwrite 8k   8       2         260935 -0.011%
randwrite 8k   16      1         230913 0.23%
randwrite 8k   16      2         261125 -0.01%
randwrite 128k 1       1          16009 0.094%
randwrite 128k 1       2          16070 0.035%
randwrite 128k 4       1          16073 -0.62%
randwrite 128k 4       2          16131 0.059%
randwrite 128k 8       1          16106 0.092%
randwrite 128k 8       2          16153 0.048%
randwrite 128k 16      1          16102 -0.0091%
randwrite 128k 16      2          16160 0.048%

IOThread CPU usage
------------------
iothreads before  after
1         98.7    95.81
2         78.43   66.13

Stefan

Hello Stefan,

Thank you very much for your effort in running these benchmarks.
The results show a pattern very similar to what our performance team
observed.

I fully agree with the 5% threshold for the default behavior.
However, we need an approach that balances the current performance
oriented polling scheme with CPU efficiency.

I found that relying on grow/shrink parameters was too limited to
achieve these results. This is why I've adjusted the process using a
weight-based grow/shrink approach to ensure the polling window remains
robust against jitter. Specifically, it avoids abrupt resets to zero
by implementing a gradual shrink rather than an immediate reset, even
when device latency exceeds the threshold.

As seen in both your results and our team's measurements, this may lead
to a bit of a performance trade-off, but it provides a reasonable
balance for CPU-sensitive environment.

Thank you for suggesting the video call and I am also looking forward to
hearing your thoughts. I'm on US Central Time. Except for Tuesday, I can
adjust my schedule to a time that works for you.

Please let me know your preferred time.

Regards,
Jaehoon Kim


Reply via email to