On 2/3/26 11:06 AM, Michael S. Tsirkin wrote:
On Tue, Feb 03, 2026 at 09:31:04AM -0800, Pierrick Bouvier wrote:
On 2/3/26 3:07 AM, Michael S. Tsirkin wrote:
On Mon, Feb 02, 2026 at 07:22:16PM -0800, Pierrick Bouvier wrote:
On 2/2/26 11:25 AM, Pierrick Bouvier wrote:
On 2/2/26 10:52 AM, Stefan Hajnoczi wrote:
This command-line lets you benchmark virtio-blk without actual I/O
slowing down the request processing:
qemu-system-x86_64 \
-M accel=kvm \
-cpu host \
-m 4G \
--blockdev
file,node-name=drive0,filename=boot.img,cache.direct=on,aio=native \
--blockdev null-co,node-name=drive1,size=$((10 * 1024 * 1024 * 1024))
\
--object iothread,id=iothread0 \
--device virtio-blk-pci,drive=drive0,iothread=iothread0 \
--device virtio-blk-pci,drive=drive1,iothread=iothread0
Here is a fio command-line for 4 KiB random reads:
fio \
--ioengine=libaio \
--direct=1 \
--runtime=30 \
--ramp_time=10 \
--rw=randread \
--bs=4k \
--iodepth=128 \
--filename=/dev/vdb \
--name=randread
This is just a single vCPU, but it should be enough to see if there is
any difference in I/O Operations Per Second (IOPS) or efficiency
(IOPS/CPU utilization).
Thanks very much for the info Stefan. I didn't even know null-co
blockdev, so it definitely would have taken some time to find all this.
For what it's worth, I automated the benchmark here (need podman for
build), so it can be reused for future changes:
https://github.com/pbo-linaro/qemu-linux-stack/tree/x86_64_io_benchmark
./build.sh && ./run.sh path/to/qemu-system-x86_64
My initial testing showed a 50% slow down, which was more than
surprising. After profiling, the extra time is spent here:
https://github.com/qemu/qemu/blob/587f4a1805c83a4e1d59dd43cb14e0a834843d1d/target-info.c#L30
When we merged target-info, there have been several versions over a long
time, and I was 100% sure we were updating the target_info structure,
instead of reparsing the target_name every time. Unfortunately, that's
not the case. I'll fix that.
With that fix, there is no difference in performance (<1%).
I'll respin a v4 with the target info fix, initial v1 changes and
benchmark results.
Thanks for pointing the performance issue, there was one for sure.
Regards,
Pierrick
After proper benchmarking, I get those results (in kIOPS) over 20 runs, all
including target_arch fix I mentioned.
reference: mean=239.2 var=12.16
v1: mean=232.2 var=37.06
v4-wip (optimized virtio_access_is_big_endian): mean=235.05 var=18.64
So basically, we have a 1.7% performance hit on a torture benchmark.
Is that acceptable for you, or should we dig more?
Regards,
Pierrick
Could we use some kind of linker trick?
Split just the ring manipulation code from
virtio.c and keep it target specific.
While it would allow to extract the common part, unfortunately, as long as
we keep a target specific part, we still have a problem to solve.
Indeed, some symbols (may be ring manipulation code, or just functions
mentioned below) will still be duplicated, and in the end, when we'll link
two targets needing different versions, we'll have a conflict.
Shrug. The generic slower one can be made stronger and be used then.
This is not an new or unsolveable problem.
Since our goal for a single-binary is to have at least arm and another base
architecture, we are in this situation now.
So the only solutions left are:
- accept the 1.7% regression, and we can provide more "real world"
benchmarks to show that it gets absorbed in all existing layers
- compute virtio_access_is_big_endian once and reuse this everywhere: that's
what Philippe implemented in his patches.
What would you prefer?
Let's find a way where people who are not interested in a single binary
do not have to pay the price, please.
Sure, I'll try and benchmark with Philippe additional patches, it should
remove the overhead completely.
Thanks,
Pierrick