Public bug reported:

SRU Justification:

[ Impact ]

Due to an inefficiency in the way older host kernels manage pfnmaps for
guest VM memory ranges[1], guests with large-BAR GPUs passed through
have a very long (multiple minutes) initialization time when the MMIO
window advertised by OVMF is sufficiently sized for the passed-through
BARs (i.e., the correct OVMF behavior).

We have already integrated a partial efficiency improvement [2] which is
transparent to the user in 6.8+ kernels, as well as an OVMF-based
approach to allow the user to force Jammy-like, faster boot speeds via
fw_ctl [3], but the approach in the patch series outlined in this report
is the full fix for the underlying cause of the issue on kernels that
have support for huge pfnmaps.

With this series [0] applied to both the host and guest of an impacted
system, BAR initialization times are reduced substantially: In the
commonly achieved optimal case, this results in a reduction of pfn
lookups by a factor of 256k.  For a local test system, an overhead of
~1s for DMA mapping a 32GB PCI BAR is reduced to sub-millisecond (8M
page sized operations reduced to 32 pud sized operations).

[ Test Plan ]

On a machine with GPUs with sufficiently sized BARs:
1. Create a virtual machine with 4 GPUs passed through and CPU host-passthrough 
enabled. (We use DGX H100 or A100, typically)
2. Observe that, on an unaltered 6.14 kernel, the VM boot time exceeds 5 minutes
3. After applying this series to both the host and guest kernels, boot the 
guest and observe that the VM boot time is under 30 seconds, with the BAR 
initialization steps occurring significantly faster in dmesg output.

[ Fix ]

This series attempts to fully address the issue by leveraging the huge
pfnmap support added in v6.12.  When we insert pfnmaps using pud and pmd
mappings, we can later take advantage of the knowledge of the mapping
level page mask to iterate on the relevant mapping stride.

[ Where problems could occur ]

I do not expect any regressions. The only callers of ABIs changed by
this series are also adjusted within this series.

[ Additional Context ]

[0]: 
https://lore.kernel.org/all/[email protected]/
 
[1]: 
https://lore.kernel.org/all/cahta-uyp07fgm6t1ozqkqadsa5jrzo0reneyzgqzub4mdrr...@mail.gmail.com/
[2]: https://bugs.launchpad.net/bugs/2097389
[3]: https://bugs.launchpad.net/bugs/2101903

** Affects: linux (Ubuntu)
     Importance: Undecided
     Assignee: Mitchell Augustin (mitchellaugustin)
         Status: In Progress

** Changed in: linux (Ubuntu)
     Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin)

** Changed in: linux (Ubuntu)
       Status: New => In Progress

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2111861

Title:
  VM boots slowly with large-BAR GPU Passthrough (Root Cause Fix SRU)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111861/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to