Public bug reported:

SRU Justification:

[ Impact ]

Due to an inefficiency in the way older host kernels manage pfnmaps for
guest VM memory ranges[1], guests with large-BAR GPUs passed through
have a very long (multiple minutes) initialization time when the MMIO
window advertised by OVMF is sufficiently sized for the passed-through
BARs (i.e., the correct OVMF behavior).

We have already integrated a partial efficiency improvement [2] which is
transparent to the user in 6.8+ kernels, as well as an OVMF-based
approach to allow the user to force Jammy-like, faster boot speeds via
fw_ctl [3], but the approach in the patch series outlined in this report
is the full fix for the underlying cause of the issue on kernels that
have support for huge pfnmaps.

With this series [0] applied to both the host and guest of an impacted
system, BAR initialization times are reduced substantially: In the
commonly achieved optimal case, this results in a reduction of pfn
lookups by a factor of 256k.  For a local test system, an overhead of
~1s for DMA mapping a 32GB PCI BAR is reduced to sub-millisecond (8M
page sized operations reduced to 32 pud sized operations).

[ Test Plan ]

On a machine with GPUs with sufficiently sized BARs:
1. Create a virtual machine with 4 GPUs passed through and CPU host-passthrough 
enabled. (We use DGX H100 or A100, typically)
2. Observe that, on an unaltered 6.14 kernel, the VM boot time exceeds 5 minutes
3. After applying this series to both the host and guest kernels, boot the 
guest and observe that the VM boot time is under 30 seconds, with the BAR 
initialization steps occurring significantly faster in dmesg output.

[ Fix ]

This series attempts to fully address the issue by leveraging the huge
pfnmap support added in v6.12.  When we insert pfnmaps using pud and pmd
mappings, we can later take advantage of the knowledge of the mapping
level page mask to iterate on the relevant mapping stride.

[ Where problems could occur ]

I do not expect any regressions. The only callers of ABIs changed by
this series are also adjusted within this series.

[ Additional Context ]

[0]: 
https://lore.kernel.org/all/20250205231728.2527186-1-alex.william...@redhat.com/
 
[1]: 
https://lore.kernel.org/all/cahta-uyp07fgm6t1ozqkqadsa5jrzo0reneyzgqzub4mdrr...@mail.gmail.com/
[2]: https://bugs.launchpad.net/bugs/2097389
[3]: https://bugs.launchpad.net/bugs/2101903

** Affects: linux (Ubuntu)
     Importance: Undecided
     Assignee: Mitchell Augustin (mitchellaugustin)
         Status: In Progress

** Changed in: linux (Ubuntu)
     Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin)

** Changed in: linux (Ubuntu)
       Status: New => In Progress

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2111861

Title:
  VM boots slowly with large-BAR GPU Passthrough (Root Cause Fix SRU)

Status in linux package in Ubuntu:
  In Progress

Bug description:
  SRU Justification:

  [ Impact ]

  Due to an inefficiency in the way older host kernels manage pfnmaps
  for guest VM memory ranges[1], guests with large-BAR GPUs passed
  through have a very long (multiple minutes) initialization time when
  the MMIO window advertised by OVMF is sufficiently sized for the
  passed-through BARs (i.e., the correct OVMF behavior).

  We have already integrated a partial efficiency improvement [2] which
  is transparent to the user in 6.8+ kernels, as well as an OVMF-based
  approach to allow the user to force Jammy-like, faster boot speeds via
  fw_ctl [3], but the approach in the patch series outlined in this
  report is the full fix for the underlying cause of the issue on
  kernels that have support for huge pfnmaps.

  With this series [0] applied to both the host and guest of an impacted
  system, BAR initialization times are reduced substantially: In the
  commonly achieved optimal case, this results in a reduction of pfn
  lookups by a factor of 256k.  For a local test system, an overhead of
  ~1s for DMA mapping a 32GB PCI BAR is reduced to sub-millisecond (8M
  page sized operations reduced to 32 pud sized operations).

  [ Test Plan ]

  On a machine with GPUs with sufficiently sized BARs:
  1. Create a virtual machine with 4 GPUs passed through and CPU 
host-passthrough enabled. (We use DGX H100 or A100, typically)
  2. Observe that, on an unaltered 6.14 kernel, the VM boot time exceeds 5 
minutes
  3. After applying this series to both the host and guest kernels, boot the 
guest and observe that the VM boot time is under 30 seconds, with the BAR 
initialization steps occurring significantly faster in dmesg output.

  [ Fix ]

  This series attempts to fully address the issue by leveraging the huge
  pfnmap support added in v6.12.  When we insert pfnmaps using pud and pmd
  mappings, we can later take advantage of the knowledge of the mapping
  level page mask to iterate on the relevant mapping stride.

  [ Where problems could occur ]

  I do not expect any regressions. The only callers of ABIs changed by
  this series are also adjusted within this series.

  [ Additional Context ]

  [0]: 
https://lore.kernel.org/all/20250205231728.2527186-1-alex.william...@redhat.com/
 
  [1]: 
https://lore.kernel.org/all/cahta-uyp07fgm6t1ozqkqadsa5jrzo0reneyzgqzub4mdrr...@mail.gmail.com/
  [2]: https://bugs.launchpad.net/bugs/2097389
  [3]: https://bugs.launchpad.net/bugs/2101903

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111861/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to