On Tue, Jun 21, 2016 at 1:25 PM, Christian Borntraeger <[email protected]> wrote: > On 06/21/2016 02:13 PM, Stefan Hajnoczi wrote: >> v4: >> * Rebased onto qemu.git/master >> * Included latest performance results >> >> v3: >> * Drop Patch 1 to batch guest notify for non-dataplane >> >> The Linux AIO completion BH and the virtio-blk batch notify BH changed >> order >> in the AioContext->first_bh list as a side-effect of moving the BH from >> hw/block/dataplane/virtio-blk.c to hw/block/virtio-blk.c. This caused a >> serious performance regression for both dataplane and non-dataplane. >> >> I've decided not to move the BH in this series and work on a separate >> solution for making batch notify generic. >> >> The remaining patches have been reordered and cleaned up. >> >> * See performance data below. >> >> v2: >> * Simplify s->rq live migration [Paolo] >> * Use more efficient bitmap ops for batch notification [Paolo] >> * Fix perf regression due to batch notify BH in wrong AioContext [Christian] >> >> The virtio_blk guest driver has supported multiple virtqueues since Linux >> 3.17. >> This patch series adds multiple virtqueues to QEMU's virtio-blk emulated >> device. >> >> Ming Lei sent patches previously but these were not merged. This series >> implements virtio-blk multiqueue for QEMU from scratch since the codebase has >> changed. Live migration support for s->rq was also missing from the previous >> series and has been added. >> >> It's important to note that QEMU's block layer does not support multiqueue >> yet. >> Therefore virtio-blk device processes all virtqueues in the same AioContext >> (IOThread). Further work is necessary to take advantage of multiqueue >> support >> in QEMU's block layer once it becomes available. >> >> Performance results: >> >> Using virtio-blk-pci,num-queues=4 can produce a speed-up but -smp 4 >> introduces a lot of variance across runs. No pinning was performed. >> >> RHEL 7.2 guest on RHEL 7.2 host with 1 vcpu and 1 GB RAM unless otherwise >> noted. The default configuration of the Linux null_blk driver is used as >> /dev/vdb. >> >> $ cat files/fio.job >> [global] >> filename=/dev/vdb >> ioengine=libaio >> direct=1 >> runtime=60 >> ramp_time=5 >> gtod_reduce=1 >> >> [job1] >> numjobs=4 >> iodepth=16 >> rw=randread >> bs=4K >> >> $ ./analyze.py runs/ >> Name IOPS Error >> v4-smp-4-dataplane 13326598.0 ± 6.31% >> v4-smp-4-dataplane-no-mq 11483568.0 ± 3.42% >> v4-smp-4-no-dataplane 18108611.6 ± 1.53% >> v4-smp-4-no-dataplane-no-mq 13951225.6 ± 7.81% > > This differs from the previous numbers. What is with > and what is without patch? I am surprised to see dataplane > to be slower than no-dataplane - this contradicts everything > that I have seen in the past.
I reran without the patch, just qemu.git/master: unpatched-7e13ea57f-smp-4-dataplane 11564565.4 ± 3.08% unpatched-7e13ea57f-smp-4-no-dataplane 14262888.8 ± 2.82% The host is Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz (16 logical CPUs) with 32 GB RAM. So the trend is the same without the patch. Therefore I'm "satisfied" that the mq vs no-mq numbers should an advantage for multiqueue. They also show that this patch series does not introduce a regression: v4-smp-4-dataplane-no-mq is close to unpatched-7e13ea57f-smp-4-dataplane (11483568.0 ± 3.42% vs 11564565.4 ± 3.08%) and v4-smp-4-no-dataplane-no-mq is close to unpatched-7e13ea57f-smp-4-no-dataplane (13951225.6 ± 7.81% vs 14262888.8 ± 2.82%). Stefan
