On Wed, Nov 21, 2012 at 01:22:22PM +0800, Asias He wrote: > On 11/20/2012 08:21 PM, Stefan Hajnoczi wrote: > > On Tue, Nov 20, 2012 at 10:02 AM, Asias He <[email protected]> wrote: > >> Hello Stefan, > >> > >> On 11/15/2012 11:18 PM, Stefan Hajnoczi wrote: > >>> This series adds the -device virtio-blk-pci,x-data-plane=on property that > >>> enables a high performance I/O codepath. A dedicated thread is used to > >>> process > >>> virtio-blk requests outside the global mutex and without going through > >>> the QEMU > >>> block layer. > >>> > >>> Khoa Huynh <[email protected]> reported an increase from 140,000 IOPS to > >>> 600,000 > >>> IOPS for a single VM using virtio-blk-data-plane in July: > >>> > >>> http://comments.gmane.org/gmane.comp.emulators.kvm.devel/94580 > >>> > >>> The virtio-blk-data-plane approach was originally presented at Linux > >>> Plumbers > >>> Conference 2010. The following slides contain a brief overview: > >>> > >>> > >>> http://linuxplumbersconf.org/2010/ocw/system/presentations/651/original/Optimizing_the_QEMU_Storage_Stack.pdf > >>> > >>> The basic approach is: > >>> 1. Each virtio-blk device has a thread dedicated to handling ioeventfd > >>> signalling when the guest kicks the virtqueue. > >>> 2. Requests are processed without going through the QEMU block layer using > >>> Linux AIO directly. > >>> 3. Completion interrupts are injected via irqfd from the dedicated thread. > >>> > >>> To try it out: > >>> > >>> qemu -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=... > >>> -device virtio-blk-pci,drive=drive0,scsi=off,x-data-plane=on > >> > >> > >> Is this the latest dataplane bits: > >> (git://github.com/stefanha/qemu.git virtio-blk-data-plane) > >> > >> commit 7872075c24fa01c925d4f41faa9d04ce69bf5328 > >> Author: Stefan Hajnoczi <[email protected]> > >> Date: Wed Nov 14 15:45:38 2012 +0100 > >> > >> virtio-blk: add x-data-plane=on|off performance feature > >> > >> > >> With this commit on a ramdisk based box, I am seeing about 10K IOPS with > >> x-data-plane on and 90K IOPS with x-data-plane off. > >> > >> Any ideas? > >> > >> Command line I used: > >> > >> IMG=/dev/ram0 > >> x86_64-softmmu/qemu-system-x86_64 \ > >> -drive file=/root/img/sid.img,if=ide \ > >> -drive file=${IMG},if=none,cache=none,aio=native,id=disk1 -device > >> virtio-blk-pci,x-data-plane=off,drive=disk1,scsi=off \ > >> -kernel $KERNEL -append "root=/dev/sdb1 console=tty0" \ > >> -L /tmp/qemu-dataplane/share/qemu/ -nographic -vnc :0 -enable-kvm -m > >> 2048 -smp 4 -cpu qemu64,+x2apic -M pc > > > > Was just about to send out the latest patch series which addresses > > review comments, so I have tested the latest code > > (61b70fef489ce51ecd18d69afb9622c110b9315c). > > > > I was unable to reproduce a ramdisk performance regression on Linux > > 3.6.6-3.fc18.x86_64 with Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz with > > 8 GB RAM. > > I am using the latest upstream kernel. > > > The ramdisk is 4 GB and I used your QEMU command-line with a RHEL 6.3 guest. > > > > Summary results: > > x-data-plane-on: iops=132856 aggrb=1039.1MB/s > > x-data-plane-off: iops=126236 aggrb=988.40MB/s > > > > virtio-blk-data-plane is ~5% faster in this benchmark. > > > > fio jobfile: > > [global] > > filename=/dev/vda > > blocksize=8k > > ioengine=libaio > > direct=1 > > iodepth=8 > > runtime=120 > > time_based=1 > > > > [reads] > > readwrite=randread > > numjobs=4 > > > > Perf top (data-plane-on): > > 3.71% [kvm] [k] kvm_arch_vcpu_ioctl_run > > 3.27% [kernel] [k] memset <--- ramdisk > > 2.98% [kernel] [k] do_blockdev_direct_IO > > 2.82% [kvm_intel] [k] vmx_vcpu_run > > 2.66% [kernel] [k] _raw_spin_lock_irqsave > > 2.06% [kernel] [k] put_compound_page > > 2.06% [kernel] [k] __get_page_tail > > 1.83% [i915] [k] __gen6_gt_force_wake_mt_get > > 1.75% [kernel] [k] _raw_spin_unlock_irqrestore > > 1.33% qemu-system-x86_64 [.] vring_pop <--- virtio-blk-data-plane > > 1.19% [kernel] [k] compound_unlock_irqrestore > > 1.13% [kernel] [k] gup_huge_pmd > > 1.11% [kernel] [k] __audit_syscall_exit > > 1.07% [kernel] [k] put_page_testzero > > 1.01% [kernel] [k] fget > > 1.01% [kernel] [k] do_io_submit > > > > Since the ramdisk (memset and page-related functions) is so prominent > > in perf top, I also tried a 1-job 8k dd sequential write test on a > > Samsung 830 Series SSD where virtio-blk-data-plane was 9% faster than > > virtio-blk. Optimizing against ramdisk isn't a good idea IMO because > > it acts very differently from real hardware where the driver relies on > > mmio, DMA, and interrupts (vs synchronous memcpy/memset). > > For the memset in ramdisk, you can simply patch drivers/block/brd.c to > do nop instead of memset for testing. > > Yes, if you have fast SSD device (sometimes you need multiple which I > do not have), it makes more sense to test on real hardware. However, > ramdisk test is still useful. It gives rough performance numbers. If A > and B are both tested against ramdisk. The difference between A and B > are still useful.
Optimizing the difference between A and B on ramdisk is only guaranteed to optimize the ramdisk case. On real hardware the bottleneck might be elsewhere and we'd be chasing the wrong lead. I don't think it's a waste of time but I think to stay healthy we need to focus on real disks and SSDs most of the time. Stefan
