On Mon, Feb 02, 2026 at 07:22:16PM -0800, Pierrick Bouvier wrote:
> On 2/2/26 11:25 AM, Pierrick Bouvier wrote:
> > On 2/2/26 10:52 AM, Stefan Hajnoczi wrote:
> > >
> > > This command-line lets you benchmark virtio-blk without actual I/O
> > > slowing down the request processing:
> > >
> > > qemu-system-x86_64 \
> > > -M accel=kvm \
> > > -cpu host \
> > > -m 4G \
> > > --blockdev
> > > file,node-name=drive0,filename=boot.img,cache.direct=on,aio=native \
> > > --blockdev null-co,node-name=drive1,size=$((10 * 1024 * 1024 *
> > > 1024)) \
> > > --object iothread,id=iothread0 \
> > > --device virtio-blk-pci,drive=drive0,iothread=iothread0 \
> > > --device virtio-blk-pci,drive=drive1,iothread=iothread0
> > >
> > > Here is a fio command-line for 4 KiB random reads:
> > >
> > > fio \
> > > --ioengine=libaio \
> > > --direct=1 \
> > > --runtime=30 \
> > > --ramp_time=10 \
> > > --rw=randread \
> > > --bs=4k \
> > > --iodepth=128 \
> > > --filename=/dev/vdb \
> > > --name=randread
> > >
> > > This is just a single vCPU, but it should be enough to see if there is
> > > any difference in I/O Operations Per Second (IOPS) or efficiency
> > > (IOPS/CPU utilization).
> > >
> > Thanks very much for the info Stefan. I didn't even know null-co
> > blockdev, so it definitely would have taken some time to find all this.
> >
> > For what it's worth, I automated the benchmark here (need podman for
> > build), so it can be reused for future changes:
> > https://github.com/pbo-linaro/qemu-linux-stack/tree/x86_64_io_benchmark
> >
> > ./build.sh && ./run.sh path/to/qemu-system-x86_64
> >
> > My initial testing showed a 50% slow down, which was more than
> > surprising. After profiling, the extra time is spent here:
> > https://github.com/qemu/qemu/blob/587f4a1805c83a4e1d59dd43cb14e0a834843d1d/target-info.c#L30
> >
> > When we merged target-info, there have been several versions over a long
> > time, and I was 100% sure we were updating the target_info structure,
> > instead of reparsing the target_name every time. Unfortunately, that's
> > not the case. I'll fix that.
> >
> > With that fix, there is no difference in performance (<1%).
> >
> > I'll respin a v4 with the target info fix, initial v1 changes and
> > benchmark results.
> >
> > Thanks for pointing the performance issue, there was one for sure.
> >
> > Regards,
> > Pierrick
>
> After proper benchmarking, I get those results (in kIOPS) over 20 runs, all
> including target_arch fix I mentioned.
>
> reference: mean=239.2 var=12.16
> v1: mean=232.2 var=37.06
> v4-wip (optimized virtio_access_is_big_endian): mean=235.05 var=18.64
>
> So basically, we have a 1.7% performance hit on a torture benchmark.
> Is that acceptable for you, or should we dig more?
>
> Regards,
> Pierrick
Could we use some kind of linker trick?
Split just the ring manipulation code from
virtio.c and keep it target specific.
Or it could be that even just these two would be enough:
static inline uint16_t virtio_lduw_phys_cached(VirtIODevice *vdev,
MemoryRegionCache *cache,
hwaddr pa)
{
if (virtio_access_is_big_endian(vdev)) {
return lduw_be_phys_cached(cache, pa);
}
return lduw_le_phys_cached(cache, pa);
}
static inline void virtio_stw_phys_cached(VirtIODevice *vdev,
MemoryRegionCache *cache,
hwaddr pa, uint16_t value)
{
if (virtio_access_is_big_endian(vdev)) {
stw_be_phys_cached(cache, pa, value);
} else {
stw_le_phys_cached(cache, pa, value);
}
}
--
MST