On Tue, Feb 03, 2026 at 09:31:04AM -0800, Pierrick Bouvier wrote:
> On 2/3/26 3:07 AM, Michael S. Tsirkin wrote:
> > On Mon, Feb 02, 2026 at 07:22:16PM -0800, Pierrick Bouvier wrote:
> > > On 2/2/26 11:25 AM, Pierrick Bouvier wrote:
> > > > On 2/2/26 10:52 AM, Stefan Hajnoczi wrote:
> > > > > 
> > > > > This command-line lets you benchmark virtio-blk without actual I/O
> > > > > slowing down the request processing:
> > > > > 
> > > > >      qemu-system-x86_64 \
> > > > >          -M accel=kvm \
> > > > >          -cpu host \
> > > > >          -m 4G \
> > > > >          --blockdev 
> > > > > file,node-name=drive0,filename=boot.img,cache.direct=on,aio=native \
> > > > >          --blockdev null-co,node-name=drive1,size=$((10 * 1024 * 1024 
> > > > > * 1024)) \
> > > > >          --object iothread,id=iothread0 \
> > > > >          --device virtio-blk-pci,drive=drive0,iothread=iothread0 \
> > > > >          --device virtio-blk-pci,drive=drive1,iothread=iothread0
> > > > > 
> > > > > Here is a fio command-line for 4 KiB random reads:
> > > > > 
> > > > >      fio \
> > > > >          --ioengine=libaio \
> > > > >          --direct=1 \
> > > > >          --runtime=30 \
> > > > >          --ramp_time=10 \
> > > > >          --rw=randread \
> > > > >          --bs=4k \
> > > > >          --iodepth=128 \
> > > > >          --filename=/dev/vdb \
> > > > >          --name=randread
> > > > > 
> > > > > This is just a single vCPU, but it should be enough to see if there is
> > > > > any difference in I/O Operations Per Second (IOPS) or efficiency
> > > > > (IOPS/CPU utilization).
> > > > > 
> > > > Thanks very much for the info Stefan. I didn't even know null-co
> > > > blockdev, so it definitely would have taken some time to find all this.
> > > > 
> > > > For what it's worth, I automated the benchmark here (need podman for
> > > > build), so it can be reused for future changes:
> > > > https://github.com/pbo-linaro/qemu-linux-stack/tree/x86_64_io_benchmark
> > > > 
> > > > ./build.sh && ./run.sh path/to/qemu-system-x86_64
> > > > 
> > > > My initial testing showed a 50% slow down, which was more than
> > > > surprising. After profiling, the extra time is spent here:
> > > > https://github.com/qemu/qemu/blob/587f4a1805c83a4e1d59dd43cb14e0a834843d1d/target-info.c#L30
> > > > 
> > > > When we merged target-info, there have been several versions over a long
> > > > time, and I was 100% sure we were updating the target_info structure,
> > > > instead of reparsing the target_name every time. Unfortunately, that's
> > > > not the case. I'll fix that.
> > > > 
> > > > With that fix, there is no difference in performance (<1%).
> > > > 
> > > > I'll respin a v4 with the target info fix, initial v1 changes and
> > > > benchmark results.
> > > > 
> > > > Thanks for pointing the performance issue, there was one for sure.
> > > > 
> > > > Regards,
> > > > Pierrick
> > > 
> > > After proper benchmarking, I get those results (in kIOPS) over 20 runs, 
> > > all
> > > including target_arch fix I mentioned.
> > > 
> > > reference: mean=239.2 var=12.16
> > > v1: mean=232.2 var=37.06
> > > v4-wip (optimized virtio_access_is_big_endian): mean=235.05 var=18.64
> > > 
> > > So basically, we have a 1.7% performance hit on a torture benchmark.
> > > Is that acceptable for you, or should we dig more?
> > > 
> > > Regards,
> > > Pierrick
> > 
> > Could we use some kind of linker trick?
> > Split just the ring manipulation code from
> > virtio.c and keep it target specific.
> > 
> 
> While it would allow to extract the common part, unfortunately, as long as
> we keep a target specific part, we still have a problem to solve.
>
>
> Indeed, some symbols (may be ring manipulation code, or just functions
> mentioned below) will still be duplicated, and in the end, when we'll link
> two targets needing different versions, we'll have a conflict.

Shrug. The generic slower one can be made stronger and be used then.
This is not an new or unsolveable problem.


> Since our goal for a single-binary is to have at least arm and another base
> architecture, we are in this situation now.
> 
> So the only solutions left are:
> - accept the 1.7% regression, and we can provide more "real world"
> benchmarks to show that it gets absorbed in all existing layers
> - compute virtio_access_is_big_endian once and reuse this everywhere: that's
> what Philippe implemented in his patches.
> 
> What would you prefer?


Let's find a way where people who are not interested in a single binary
do not have to pay the price, please.

> > 
> > Or it could be that even just these two would be enough:
> > 
> > static inline uint16_t virtio_lduw_phys_cached(VirtIODevice *vdev,
> >                                                 MemoryRegionCache *cache,
> >                                                 hwaddr pa)
> > {
> >      if (virtio_access_is_big_endian(vdev)) {
> >          return lduw_be_phys_cached(cache, pa);
> >      }
> >      return lduw_le_phys_cached(cache, pa);
> > }
> > 
> > static inline void virtio_stw_phys_cached(VirtIODevice *vdev,
> >                                            MemoryRegionCache *cache,
> >                                            hwaddr pa, uint16_t value)
> > {
> >      if (virtio_access_is_big_endian(vdev)) {
> >          stw_be_phys_cached(cache, pa, value);
> >      } else {
> >          stw_le_phys_cached(cache, pa, value);
> >      }
> > }
> > 
> > 
> > 
> > 
> > 
> > 
> 
> Regards,
> Pierrick


Reply via email to