I finally got around to reading the Linux multiqueue block layer paper and wanted to share some thoughts about how it relates to QEMU and dataplane/QContext: http://kernel.dk/blk-mq.pdf
I think Jens has virtio-blk multiqueue patches. So let's imagine that the virtio-blk device has multiple virtqueues. (virtio-scsi is already multiqueue BTW.) The paper focusses on two queue mappings: 1 queue per core and 1 queue per node. In both cases the idea is to keep the block I/O code path localized. This makes block I/O scale as the number of CPUs increases. In QEMU we'd want to set up a mapping for the virtio-blk mq device: each guest vcpu or guest node has a virtio-blk virtqueue which is serviced by a dataplane/QContext thread. QEMU would then process requests across these queues in parallel, although currently BlockDriverState is not thread-safe. At least for raw we should be able to submit requests in parallel from QEMU. Unfortunately there are some complications in the QEMU block layer: QEMU's own accounting, request tracking, and throttling features are global. We'd need to eventually do something similar to the multiqueue block layer changes in the kernel to detangle this state. Doing multiqueue for image formats is much more challenging - we'd have to tackle thread-safety in qcow2 and friends. For network block drivers like Gluster or NBD it's also not 100% clear what the best approach is. But I think the target here is local SSDs that are capable of high IOPs together with an SMP guest. At the end of all this we'd arrive at the following architecture: 1. Guest virtio device has multiple queues (1 per node or vcpu). 2. QEMU has multiple dataplane/QContext threads that process virtqueue kicks, they are bound to host CPUs/nodes. 3. Linux kernel has multiqueue block I/O. Jens: when experimenting with multiqueue virtio-blk, how far did you modify QEMU to eliminate global request processing state from block.c? Stefan