Agreed on Frank's last comment. commit a32295c612c57990d17fb0f41e7134394b2f35f6 Author: Alexey Kardashevskiy <a...@ozlabs.ru> Date: Wed Dec 13 00:31:31 2017
vfio-pci: Allow mapping MSIX BAR By default VFIO disables mapping of MSIX BAR to the userspace as the userspace may program it in a way allowing spurious interrupts; instead the userspace uses the VFIO_DEVICE_SET_IRQS ioctl. In order to eliminate guessing from the userspace about what is mmapable, VFIO also advertises a sparse list of regions allowed to mmap. This works fine as long as the system page size equals to the MSIX alignment requirement which is 4KB. However with a bigger page size the existing code prohibits mapping non-MSIX parts of a page with MSIX structures so these parts have to be emulated via slow reads/writes on a VFIO device fd. If these emulated bits are accessed often, this has serious impact on performance. This allows mmap of the entire BAR containing MSIX vector table. This removes the sparse capability for PCI devices as it becomes useless. As the userspace needs to know for sure whether mmapping of the MSIX vector containing data can succeed, this adds a new capability - VFIO_REGION_INFO_CAP_MSIX_MAPPABLE - which explicitly tells the userspace that the entire BAR can be mmapped. This does not touch the MSIX mangling in the BAR read/write handlers as we are doing this just to enable direct access to non MSIX registers. Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> [aw - fixup whitespace, trim function name] Signed-off-by: Alex Williamson <alex.william...@redhat.com> Seems a case for HWE kernels. ** Changed in: linux (Ubuntu Bionic) Status: New => Won't Fix -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1847948 Title: Improve NVMe guest performance on Bionic QEMU Status in The Ubuntu-power-systems project: Triaged Status in linux package in Ubuntu: Fix Released Status in qemu package in Ubuntu: Fix Released Status in linux source package in Bionic: Won't Fix Status in qemu source package in Bionic: Triaged Bug description: [Impact] * In the past qemu has generally not allowd MSI-X BAR mapping on VFIO. But there can be platforms (like ppc64 spapr) that can and want to do exactly that. * Backport two patches from upstream (in since qemu 2.12 / Disco). * Due to that there is a tremendous speedup, especially useful with page size bigger than 4k. This avoids that being split into chunks and makes direct MMIO access possible for the guest. [Test Case] * On ppc64 pass through an NVME device to the guest and run I/O benchmarks, see below for Details how to set that up. Note: this needs the HWE kernel or another kernel fixup for [1]. Note: the test should also be done with the non-HWE kernel, the expectation there is that it would not show the perf benefits, but still work fine [Regression Potential] * Changes: a) if the host driver allows mapping of MSI-X data the entire BAR is mapped. This is only done if the kernel reports that capability [1]. This ensures that only on kernels able to do so qemu does expose the new behavior (safe against regression in that regard) b) on ppc64 MSI-X emulation is disabled for VFIO devices this is local to just this HW and will not affect other HW. Generally the regressions that come to mind are slight changes in behavior (real HW vs the former emulation) that on some weird/old guests could cause trouble. But then it is limited to only PPC where only a small set of certified HW is really allowed. The mapping that might be added even on other platforms should not consume too much extra memory as long as it isn't used. Further since it depends on the kernel capability it isn't randomly issues on kernels where we expect it to fail. So while it is quite a change, it seems safe to me. [Other Info] * I know, one could as well call that a "feature", but it really is a performance bug fix more than anything else. Also the SRU policy allows exploitation/toleration of new HW especially for LTS releases. Therefore I think this is fine as SRU. [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a32295c612c57990d17fb0f41e7134394b2f35f6 == Comment: #0 - Murilo Opsfelder Araujo - 2019-10-11 14:16:14 == ---Problem Description--- Back-port the following patches to Bionic QEMU to improve NVMe guest performance by more than 200%: ?vfio-pci: Allow mmap of MSIX BAR? https://git.qemu.org/?p=qemu.git;a=commit;h=ae0215b2bb56a9d5321a185dde133bfdd306a4c0 ?ppc/spapr, vfio: Turn off MSIX emulation for VFIO devices? https://git.qemu.org/?p=qemu.git;a=commit;h=fcad0d2121976df4b422b4007a5eb7fcaac01134 ---uname output--- na ---Additional Hardware Info--- 0030:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172Xa/172Xb (rev 01) Machine Type = AC922 ---Debugger--- A debugger is not configured ---Steps to Reproduce--- Install or setup a guest image and boot it. Once guest is running, passthrough the NVMe disk to the guest using the XML: host$ cat nvme-disk.xml <hostdev mode='subsystem' type='pci' managed='no'> <driver name='vfio'/> <source> <address domain='0x0030' bus='0x01' slot='0x00' function='0x0'/> </source> </hostdev> host$ virsh attach-device <domain> nvme-disk.xml --live On the guest, run fio benchmarks: guest$ fio --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --runtime=60 --name=job1 --filename=/dev/nvme0n1 --numjobs=4 Results are similar with numjobs=4 and numjobs=64, respectively: READ: bw=385MiB/s (404MB/s), 78.0MiB/s-115MiB/s (81.8MB/s-120MB/s), io=11.3GiB (12.1GB), run=30001-30001msec READ: bw=382MiB/s (400MB/s), 2684KiB/s-12.6MiB/s (2749kB/s-13.2MB/s), io=11.2GiB (12.0GB), run=30001-30009msec With the two patches applied, performance improved significantly for numjobs=4 and numjobs=64 cases, respectively: READ: bw=1191MiB/s (1249MB/s), 285MiB/s-309MiB/s (299MB/s-324MB/s), io=34.9GiB (37.5GB), run=30001-30001msec READ: bw=4273MiB/s (4481MB/s), 49.7MiB/s-113MiB/s (52.1MB/s-119MB/s), io=125GiB (134GB), run=30001-30005msec Userspace tool common name: qemu Userspace rpm: qemu The userspace tool has the following bit modes: 64-bit Userspace tool obtained from project website: na To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1847948/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp