[Kernel-packages] [Bug 2089306] Re: vfio_pci soft lockup on VM start while using PCIe passthrough

Jacob Martin Thu, 21 Nov 2024 11:56:43 -0800

Soft lockup warnings on host while VM launches with neither patch reverted:
[ 3575.401356] watchdog: BUG: soft lockup - CPU#214 stuck for 26s! [CPU 
23/KVM:16612]
[ 3575.409882] Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack 
ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat vhost_net tap vfio_pci 
vfio_pci_core vfio_iommu_type1 vfio iommufd nvidia_uvm(OE) nvidia_drm(OE) 
nvidia_modeset(OE) nft_masq nft_chain_nat bridge stp llc zfs(PO) spl(O) 
nvme_tcp nvme_keyring nvme_fabrics ebtable_filter ebtables ip6table_raw 
ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_raw 
iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 
iptable_filter nf_tables intel_rapl_msr intel_rapl_common 
intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac nfit 
x86_pkg_temp_thermal intel_powerclamp vhost_vsock 
vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock qrtr coretemp 
kvm_intel kvm irqbypass iaa_crypto rapl cfg80211 binfmt_misc cmdlinepart 
spi_nor pmt_telemetry mtd pmt_class intel_sdsi dax_hmem qat_4xxx intel_cstate 
cxl_acpi cxl_core nvidia(OE) nls_iso8859_1 ast intel_qat isst_if_mbox_pci 
isst_if_
 mmio idxd video crc8 isst_if_common
[ 3575.409925]  ecc i2c_algo_bit intel_vsec intel_th_gth authenc idxd_bus 
mei_me i2c_i801 intel_th_pci spi_intel_pci switchtec mei intel_th i2c_smbus 
i2c_ismt spi_intel ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler 
mac_hid sch_fq_codel dm_multipath msr efi_pstore nfnetlink dmi_sysfs ip_tables 
x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 
mlx5_ib(OE) ib_uverbs(OE) macsec ib_core(OE) mlx5_core(OE) mlxfw(OE) 
crct10dif_pclmul psample crc32_pclmul polyval_clmulni mlxdevm(OE) 
polyval_generic ixgbe ghash_clmulni_intel tls ice nvme xfrm_algo sha256_ssse3 
sha1_ssse3 mlx_compat(OE) xhci_pci dca gnss nvme_core pci_hyperv_intf 
xhci_pci_renesas mdio nvme_auth wmi pinctrl_emmitsburg aesni_intel crypto_simd 
cryptd
[ 3575.409965] CPU: 214 PID: 16612 Comm: CPU 23/KVM Tainted: P           OEL    
 6.8.0-49-generic #49-Ubuntu
[ 3575.409967] Hardware name: NVIDIA DGXH100/DGXH100, BIOS 1.1.3 10/30/2023
[ 3575.409968] RIP: 0010:find_next_iomem_res+0x4f/0x160
[ 3575.409976] Code: fc 48 c7 c7 50 5e 3d 8e 53 48 89 f3 e8 1a a9 13 01 48 8b 
05 fb 9a 37 02 48 85 c0 0f 84 c8 00 00 00 48 3b 18 0f 82 bf 00 00 00 <4c> 39 60 
08 0f 82 86 00 00 00 48 8b 50 18 4c 21 ea 4c 39 ea 75 7a
[ 3575.409978] RSP: 0018:ff433e90f4c274c0 EFLAGS: 00000206
[ 3575.409979] RAX: ff3377d798b31b60 RBX: 000021b4bbb23fff RCX: 0000000000000000
[ 3575.409980] RDX: 0000000081000200 RSI: ff3377d798b31b60 RDI: 0000000000000000
[ 3575.409981] RBP: ff433e90f4c274e8 R08: ff433e90f4c27508 R09: 0000000000000000
[ 3575.409982] R10: 0000000000000000 R11: 0000000000000000 R12: 000021b4bbb23000
[ 3575.409982] R13: 0000000081000200 R14: ff433e90f4c27508 R15: 0000000000000000
[ 3575.409983] FS:  0000796f86a006c0(0000) GS:ff3379d2ff900000(0000) 
knlGS:0000000000000000
[ 3575.409984] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3575.409985] CR2: 0000000000000000 CR3: 000001017e45a003 CR4: 0000000000f73ef0
[ 3575.409986] PKRU: 55555554
[ 3575.409987] Call Trace:
[ 3575.409988]  <IRQ>
[ 3575.409991]  ? show_regs+0x6d/0x80
[ 3575.409995]  ? watchdog_timer_fn+0x206/0x290
[ 3575.409999]  ? __pfx_watchdog_timer_fn+0x10/0x10
[ 3575.410001]  ? __hrtimer_run_queues+0x10f/0x2a0
[ 3575.410005]  ? clockevents_program_event+0xbe/0x150
[ 3575.410008]  ? hrtimer_interrupt+0xf6/0x250
[ 3575.410010]  ? __sysvec_apic_timer_interrupt+0x4e/0x150
[ 3575.410014]  ? sysvec_apic_timer_interrupt+0x8d/0xd0
[ 3575.410020]  </IRQ>
[ 3575.410021]  <TASK>
[ 3575.410021]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[ 3575.410027]  ? find_next_iomem_res+0x4f/0x160
[ 3575.410029]  walk_system_ram_range+0x72/0x110
[ 3575.410032]  ? __pfx_pagerange_is_ram_callback+0x10/0x10
[ 3575.410036]  pat_pagerange_is_ram+0x7a/0xa0
[ 3575.410038]  lookup_memtype+0x4d/0xf0
[ 3575.410040]  track_pfn_insert+0x34/0x60
[ 3575.410042]  vmf_insert_pfn_prot+0x82/0x100
[ 3575.410047]  vmf_insert_pfn+0x12/0x20
[ 3575.410050]  vfio_pci_mmap_fault+0xda/0x140 [vfio_pci_core]
[ 3575.410057]  __do_fault+0x38/0x140
[ 3575.410059]  do_fault+0x94/0x350
[ 3575.410060]  handle_pte_fault+0x114/0x1d0
[ 3575.410062]  __handle_mm_fault+0x653/0x790
[ 3575.410065]  handle_mm_fault+0x18a/0x380
[ 3575.410066]  fixup_user_fault+0x85/0x1d0
[ 3575.410070]  follow_fault_pfn+0x76/0x170 [vfio_iommu_type1]
[ 3575.410074]  vaddr_get_pfns+0x9c/0x1c0 [vfio_iommu_type1]
[ 3575.410078]  vfio_pin_pages_remote+0x2e5/0x4a0 [vfio_iommu_type1]
[ 3575.410081]  vfio_pin_map_dma+0xd1/0x310 [vfio_iommu_type1]
[ 3575.410085]  vfio_dma_do_map+0x349/0x500 [vfio_iommu_type1]
[ 3575.410089]  vfio_iommu_type1_ioctl+0x158/0x1e0 [vfio_iommu_type1]
[ 3575.410092]  vfio_fops_unl_ioctl+0x5f/0x1a0 [vfio]
[ 3575.410100]  __x64_sys_ioctl+0xa0/0xf0
[ 3575.410104]  x64_sys_call+0x143b/0x25c0
[ 3575.410107]  do_syscall_64+0x7f/0x180
[ 3575.410112]  ? __kvm_set_memory_region.part.0+0x39c/0x660 [kvm]
[ 3575.410164]  ? interval_tree_insert+0x6d/0xd0
[ 3575.410169]  ? __kvm_set_memory_region+0x35/0x60 [kvm]
[ 3575.410197]  ? kvm_vm_ioctl+0x933/0xb70 [kvm]
[ 3575.410227]  ? rseq_get_rseq_cs+0x22/0x280
[ 3575.410232]  ? rseq_ip_fixup+0x90/0x1f0
[ 3575.410234]  ? syscall_exit_to_user_mode+0x86/0x260
[ 3575.410236]  ? do_syscall_64+0x8c/0x180
[ 3575.410237]  ? rseq_ip_fixup+0x90/0x1f0
[ 3575.410239]  ? syscall_exit_to_user_mode+0x86/0x260
[ 3575.410240]  ? do_syscall_64+0x8c/0x180
[ 3575.410242]  ? syscall_exit_to_user_mode+0x86/0x260
[ 3575.410243]  ? do_syscall_64+0x8c/0x180
[ 3575.410245]  ? irqentry_exit+0x43/0x50
[ 3575.410246]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[ 3575.410247] RIP: 0033:0x7971b7324ded
[ 3575.410291] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 
10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 
00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
[ 3575.410292] RSP: 002b:0000796f869fed90 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[ 3575.410293] RAX: ffffffffffffffda RBX: 00006410a32c6b50 RCX: 00007971b7324ded
[ 3575.410294] RDX: 0000796f869fedf0 RSI: 0000000000003b71 RDI: 000000000000004f
[ 3575.410294] RBP: 0000796f869fede0 R08: 0000000000000000 R09: 0000796f105f7d10
[ 3575.410295] R10: 0000382000000000 R11: 0000000000000246 R12: 0000380000000000
[ 3575.410296] R13: 0000002000000000 R14: 0000796f869feeb8 R15: 0000796f869fedf0
[ 3575.410297]  </TASK>


-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2089306

Title:
  vfio_pci soft lockup on VM start while using PCIe passthrough

Status in linux package in Ubuntu:
  Invalid
Status in linux-nvidia package in Ubuntu:
  Invalid
Status in linux source package in Noble:
  In Progress
Status in linux-nvidia source package in Noble:
  In Progress

Bug description:
  When starting a VM with a passthrough PCIe device, the vfio_pci driver
  will block while its fault handler pre-faults the entire mapped area.
  For PCIe devices with large BAR regions this takes a very long time to
  complete, and thus causes soft lockup warnings on the host. This
  process can take hours with multiple passthrough large BAR region PCIe
  devices.

  This issue was introduced in kernel version 6.8.0-48-generic, with the
  addition of patches "vfio/pci: Use unmap_mapping_range()" and
  "vfio/pci: Insert full vma on mmap'd MMIO fault".

  The patch "vfio/pci: Use unmap_mapping_range()" rewrote the way VFIO
  tracks mapped regions to use the "vmf_insert_pfn" function instead of
  tracking them itself and using "io_remap_pfn_range". The
  implementation using "vmf_insert_pfn" is significantly slower.

  The patch "vfio/pci: Insert full vma on mmap'd MMIO fault" introduced
  this pre-faulting behavior, causing soft lockup warnings on the host
  while the VM launches.

  Without "vfio/pci: Insert full vma on mmap'd MMIO fault", a guest OS
  experiences significantly longer boot times as faults are generated
  while configuring the passthrough PCIe devices, but the host does not
  see soft lockup warnings.

  Both of these performance issues are resolved upstream by patchset
  [1], but this would be a complex backport to 6.8, with significant
  changes to core parts of the kernel.

  The "vfio/pci: Use unmap_mapping_range()" patch was introduced as part
  of patchset [2], and is intended to resolve a WARN_ON splat introduced
  by the upstream patch ba168b52bf8e ("mm: use rwsem assertion macros
  for mmap_lock"). However, this mmap_lock patch is not present in
  noble:linux, and hence noble:linux was never impacted by the WARN_ON
  issue.

  Thus, we can safely revert the following patches to resolve this VFIO 
slowdown:
  - "vfio/pci: Insert full vma on mmap'd MMIO fault"
  - "vfio/pci: Use unmap_mapping_range()"

  [1] https://patchwork.kernel.org/project/linux-mm/list/?series=883517
  [2] 
https://lore.kernel.org/all/20240530045236.1005864-3-alex.william...@redhat.com/

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2089306/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2089306] Re: vfio_pci soft lockup on VM start while using PCIe passthrough

Reply via email to