Patches submitted to kernel-team mailing list:
https://lists.ubuntu.com/archives/kernel-team/2024-December/155853.html.

SRU Justification

[Impact]

The patch "vfio/pci: Use unmap_mapping_range()" rewrote the way VFIO tracks
mapped regions to use the "vmf_insert_pfn" function instead of tracking them
itself and using "io_remap_pfn_range". The implementation using
"vmf_insert_pfn" is significantly slower. To mitigate this slowdown, "vfio/pci:
Insert full vma on mmap'd MMIO fault" was introduced to prefault the entirety
of areas mapped by vfio_pci, resulting in soft lockup warnings on the host for
large BAR region devices. Reverting this prefaulting behavior does not fully
resolve the slowness, as a VM still experiences extremely slow accesses to the
passthrough devices as VMAs get faulted in, causing soft lockup warnings in the
guest during boot. Thus, "vfio/pci: Use unmap_mapping_range()" must also be
reverted to restore performance to that of versions prior to 6.8.0-48-generic.

[Fix]

Both of these performance issues are resolved upstream by patchset [1], but
this would be a complex backport to 6.8 and 6.11, with significant changes to
core parts of the kernel.

Reverting the following commits resolves the issue, with a much reduced
potential for regression:
- "mm: use rwsem assertion macros for mmap_lock" (revert needed in Oracular,
  not present in Noble)
- "vfio/pci: Insert full vma on mmap'd MMIO fault"
- "vfio/pci: Use unmap_mapping_range()"

[Test Plan]

Tested on a DGX H100 system, verified to reduce VM start time with 8
passthrough H100 GPUs from 45 minutes back down to 5 minutes and eliminate the
soft lockup warnings.

Reproduced using a libvirt VM, created with:
        $ sudo virt-install --connect qemu:///system -v --name gpu-pt-test \
                --memory 16384 --vcpus 16 --cpu host --cdrom \
                /ubuntu-24.04.1-live-server-amd64.iso --os-variant ubuntu24.04 \
                --disk size=512 -w bridge=virbr0 --graphics none \
                --console pty,target.type=virtio \
                --hostdev pci_0000_1b_00_0 --hostdev pci_0000_43_00_0 \
                --hostdev pci_0000_52_00_0 --hostdev pci_0000_61_00_0 \
                --hostdev pci_0000_9d_00_0 --hostdev pci_0000_d1_00_0 \
                --hostdev pci_0000_df_00_0 --hostdev pci_0000_c3_00_0

[Where problems could occur]

The reverts here primarily affect the vfio_pci driver. However, in Oracular
"mm: use rwsem assertion macros for mmap_lock" is also reverted. This could
result in misbehavior of the vfio_pci driver. In Oracular, it could also result
in mmap locking bugs going undetected unless testing is done with lockdep
enabled.

[1] https://patchwork.kernel.org/project/linux-mm/list/?series=883517

** Changed in: linux-nvidia (Ubuntu Noble)
       Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2089306

Title:
  vfio_pci soft lockup on VM start while using PCIe passthrough

Status in linux package in Ubuntu:
  Invalid
Status in linux-nvidia package in Ubuntu:
  Invalid
Status in linux-nvidia-6.11 package in Ubuntu:
  Invalid
Status in linux source package in Noble:
  In Progress
Status in linux-nvidia source package in Noble:
  Fix Committed
Status in linux-nvidia-6.11 source package in Noble:
  In Progress
Status in linux source package in Oracular:
  In Progress

Bug description:
  When starting a VM with a passthrough PCIe device, the vfio_pci driver
  will block while its fault handler pre-faults the entire mapped area.
  For PCIe devices with large BAR regions this takes a very long time to
  complete, and thus causes soft lockup warnings on the host. This
  process can take hours with multiple passthrough large BAR region PCIe
  devices.

  This issue was introduced in kernel version 6.8.0-48-generic, with the
  addition of patches "vfio/pci: Use unmap_mapping_range()" and
  "vfio/pci: Insert full vma on mmap'd MMIO fault".

  The patch "vfio/pci: Use unmap_mapping_range()" rewrote the way VFIO
  tracks mapped regions to use the "vmf_insert_pfn" function instead of
  tracking them itself and using "io_remap_pfn_range". The
  implementation using "vmf_insert_pfn" is significantly slower.

  The patch "vfio/pci: Insert full vma on mmap'd MMIO fault" introduced
  this pre-faulting behavior, causing soft lockup warnings on the host
  while the VM launches.

  Without "vfio/pci: Insert full vma on mmap'd MMIO fault", a guest OS
  experiences significantly longer boot times as faults are generated
  while configuring the passthrough PCIe devices, but the host does not
  see soft lockup warnings.

  Both of these performance issues are resolved upstream by patchset
  [1], but this would be a complex backport to 6.8, with significant
  changes to core parts of the kernel.

  The "vfio/pci: Use unmap_mapping_range()" patch was introduced as part
  of patchset [2], and is intended to resolve a WARN_ON splat introduced
  by the upstream patch ba168b52bf8e ("mm: use rwsem assertion macros
  for mmap_lock"). However, this mmap_lock patch is not present in
  noble:linux, and hence noble:linux was never impacted by the WARN_ON
  issue.

  Thus, we can safely revert the following patches to resolve this VFIO 
slowdown:
  - "vfio/pci: Insert full vma on mmap'd MMIO fault"
  - "vfio/pci: Use unmap_mapping_range()"

  [1] https://patchwork.kernel.org/project/linux-mm/list/?series=883517
  [2] 
https://lore.kernel.org/all/20240530045236.1005864-3-alex.william...@redhat.com/

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089306/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to