On Tue, Feb 03, 2026 at 09:31:04AM -0800, Pierrick Bouvier wrote:
> On 2/3/26 3:07 AM, Michael S. Tsirkin wrote:
> > On Mon, Feb 02, 2026 at 07:22:16PM -0800, Pierrick Bouvier wrote:
> > > On 2/2/26 11:25 AM, Pierrick Bouvier wrote:
> > > > On 2/2/26 10:52 AM, Stefan Hajnoczi wrote:
> > > > >
> > > > > This command-line lets you benchmark virtio-blk without actual I/O
> > > > > slowing down the request processing:
> > > > >
> > > > > qemu-system-x86_64 \
> > > > > -M accel=kvm \
> > > > > -cpu host \
> > > > > -m 4G \
> > > > > --blockdev
> > > > > file,node-name=drive0,filename=boot.img,cache.direct=on,aio=native \
> > > > > --blockdev null-co,node-name=drive1,size=$((10 * 1024 * 1024
> > > > > * 1024)) \
> > > > > --object iothread,id=iothread0 \
> > > > > --device virtio-blk-pci,drive=drive0,iothread=iothread0 \
> > > > > --device virtio-blk-pci,drive=drive1,iothread=iothread0
> > > > >
> > > > > Here is a fio command-line for 4 KiB random reads:
> > > > >
> > > > > fio \
> > > > > --ioengine=libaio \
> > > > > --direct=1 \
> > > > > --runtime=30 \
> > > > > --ramp_time=10 \
> > > > > --rw=randread \
> > > > > --bs=4k \
> > > > > --iodepth=128 \
> > > > > --filename=/dev/vdb \
> > > > > --name=randread
> > > > >
> > > > > This is just a single vCPU, but it should be enough to see if there is
> > > > > any difference in I/O Operations Per Second (IOPS) or efficiency
> > > > > (IOPS/CPU utilization).
> > > > >
> > > > Thanks very much for the info Stefan. I didn't even know null-co
> > > > blockdev, so it definitely would have taken some time to find all this.
> > > >
> > > > For what it's worth, I automated the benchmark here (need podman for
> > > > build), so it can be reused for future changes:
> > > > https://github.com/pbo-linaro/qemu-linux-stack/tree/x86_64_io_benchmark
> > > >
> > > > ./build.sh && ./run.sh path/to/qemu-system-x86_64
> > > >
> > > > My initial testing showed a 50% slow down, which was more than
> > > > surprising. After profiling, the extra time is spent here:
> > > > https://github.com/qemu/qemu/blob/587f4a1805c83a4e1d59dd43cb14e0a834843d1d/target-info.c#L30
> > > >
> > > > When we merged target-info, there have been several versions over a long
> > > > time, and I was 100% sure we were updating the target_info structure,
> > > > instead of reparsing the target_name every time. Unfortunately, that's
> > > > not the case. I'll fix that.
> > > >
> > > > With that fix, there is no difference in performance (<1%).
> > > >
> > > > I'll respin a v4 with the target info fix, initial v1 changes and
> > > > benchmark results.
> > > >
> > > > Thanks for pointing the performance issue, there was one for sure.
> > > >
> > > > Regards,
> > > > Pierrick
> > >
> > > After proper benchmarking, I get those results (in kIOPS) over 20 runs,
> > > all
> > > including target_arch fix I mentioned.
> > >
> > > reference: mean=239.2 var=12.16
> > > v1: mean=232.2 var=37.06
> > > v4-wip (optimized virtio_access_is_big_endian): mean=235.05 var=18.64
> > >
> > > So basically, we have a 1.7% performance hit on a torture benchmark.
> > > Is that acceptable for you, or should we dig more?
> > >
> > > Regards,
> > > Pierrick
> >
> > Could we use some kind of linker trick?
> > Split just the ring manipulation code from
> > virtio.c and keep it target specific.
> >
>
> While it would allow to extract the common part, unfortunately, as long as
> we keep a target specific part, we still have a problem to solve.
>
>
> Indeed, some symbols (may be ring manipulation code, or just functions
> mentioned below) will still be duplicated, and in the end, when we'll link
> two targets needing different versions, we'll have a conflict.
Shrug. The generic slower one can be made stronger and be used then.
This is not an new or unsolveable problem.
> Since our goal for a single-binary is to have at least arm and another base
> architecture, we are in this situation now.
>
> So the only solutions left are:
> - accept the 1.7% regression, and we can provide more "real world"
> benchmarks to show that it gets absorbed in all existing layers
> - compute virtio_access_is_big_endian once and reuse this everywhere: that's
> what Philippe implemented in his patches.
>
> What would you prefer?
Let's find a way where people who are not interested in a single binary
do not have to pay the price, please.
> >
> > Or it could be that even just these two would be enough:
> >
> > static inline uint16_t virtio_lduw_phys_cached(VirtIODevice *vdev,
> > MemoryRegionCache *cache,
> > hwaddr pa)
> > {
> > if (virtio_access_is_big_endian(vdev)) {
> > return lduw_be_phys_cached(cache, pa);
> > }
> > return lduw_le_phys_cached(cache, pa);
> > }
> >
> > static inline void virtio_stw_phys_cached(VirtIODevice *vdev,
> > MemoryRegionCache *cache,
> > hwaddr pa, uint16_t value)
> > {
> > if (virtio_access_is_big_endian(vdev)) {
> > stw_be_phys_cached(cache, pa, value);
> > } else {
> > stw_le_phys_cached(cache, pa, value);
> > }
> > }
> >
> >
> >
> >
> >
> >
>
> Regards,
> Pierrick