jammy-proposed kernel verified to fix bug on DGX H100:

I am seeing the boot time improvement on 2/2 tests (the time between
virsh start finishing and the boot reaching Host and Network Name
Lookups is down to 1:30, from 2:30 on the old kernel), GPU driver works,
and the timing of the console outputs where the BARs are being mapped
are such that it looks like the expected 'batching' behavior is
happening

** Tags removed: verification-needed-jammy-linux
** Tags added: verification-done-jammy-linux

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2097389

Title:
  VM boots slowly with large-BAR GPU Passthrough due to pci/probe.c
  redundancy

Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Noble:
  Fix Released
Status in linux source package in Oracular:
  Fix Released

Bug description:
  SRU Justification:

  [ Impact ]

  VM guests that have large-BAR GPUs passed through to them will take 2x
  as long to initialize all device BARs without this patch

  [ Test Plan ]

  I verified that this patch applies cleanly to the Noble kernel
  and resolves the bug on DGX H100 and DGX A100. I observed no regressions.
  This can be verified on any machine with a sufficiently large BAR and the
  capability to pass through to a VM using vfio.

  To verify no regressions, I applied this patch to the guest kernel, then
  rebooted and confirmed that:
  1. The measured PCI initialization time on boot was ~50% of the unmodified 
kernel
  2. Relevant parts of /proc/iomem mappings, the PCI init section of dmesg 
output, and lspci -vv output remained unchanged between the system with the 
unmodified kernel and with the patched kernel
  3. The Nvidia driver still successfully loaded and was shown via nvidia-smi 
after the patch was applied

  [ Fix ]

  Roughly half of the time consuming device configuration options invoked during
  the PCI probe function can be eliminated by rearranging the memory and I/O 
disable/enable
  calls such that they only occur per-device rather than per-BAR. This is what 
the upstream
  patch does, and it results in roughly half the excess initialization time 
being eliminated
  reliably during VM boot.

  [ Where problems could occur ]

  I do not expect any regressions. The only callers of ABIs changed by
  this patch are also adjusted within this patch, and the functional
  change only removes entirely redundant calls to disable/enable PCI
  memory/IO.

  [ Additional Context ]

  Upstream patch: 
https://lore.kernel.org/all/20250111210652.402845-1-alex.william...@redhat.com/
  Upstream bug report: 
https://lore.kernel.org/all/cahta-uyp07fgm6t1ozqkqadsa5jrzo0reneyzgqzub4mdrr...@mail.gmail.com/

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2097389/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to