Public bug reported: [Impact]
During startup on one of Google Compute Engine's C4A machines, the gVNIC will fail to initialize: [ 1.071899] gvnic 0000:00:00.0: enabling device (0010 -> 0012) [ 1.076631] ACPI: \_SB_.PCI0.GSI2: Enabled at IRQ 37 [ 1.078075] nvme nvme0: pci function 0000:00:02.0 [ 1.093687] nvme nvme0: 4/0/0 default/read/poll queues [ 1.097563] nvme0n1: p1 p15 [ 3.886472] gvnic 0000:00:00.0: AQ commands timed out, need to reset AQ [ 3.888151] gvnic 0000:00:00.0: Could not get device information: err=-131 [ 3.891458] gvnic: probe of 0000:00:00.0 failed with error -131 Because this is a cloud instance, network failure means the instance is unusable. [Fix] A patchset to make the GVE driver work on both 64k page size and 4k page size kernels was applied in Linux 6.8, so Noble and later kernels all don't have this problem. Backporting the patchset to 5.15 appears to fix the issue, as I was able to boot and connect to the machine using the patched kernel. Patchset link: https://lore.kernel.org/all/20231128002648.320892-1-jfra...@google.com/ Hashes: 955f4d3bf0a45 ("gve: Perform adminq allocations through a dma_pool.") 8ae980d24195f ("gve: Deprecate adminq_pfn for pci revision 0x1.") ce260cb114bbf ("gve: Remove obsolete checks that rely on page size.") 513072fb4bf81 ("gve: Add page size register to the register_page_list command.") da7d4b42caf1b ("gve: Remove dependency on 4k page size.") [Test plan] Boot the 64k flavor of the patched kernel on a C4A Google Compute Engine instance, and verify that you can ssh to it. [Regression potential] Of the applied patches, "gve: Remove dependency on 4k page size." was the only one to have conflicts. It's possible that there are uses of the native PAGE_SIZE definition that aren't covered by the backport of the patch. This patchset is being without including other major GVE driver patchsets that had been applied before it in mainline. Since the patches are isolated to the GVE driver, and since generic-64k previously didn't work on gVNIC instances at all, the possibility of failure is limited to configurations which were already not working, therefore not regressions. ** Affects: linux (Ubuntu) Importance: Undecided Status: New ** Affects: linux (Ubuntu Jammy) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Jammy) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2109537 Title: Jammy generic-64k fails to initialize gVNIC devices Status in linux package in Ubuntu: New Status in linux source package in Jammy: New Bug description: [Impact] During startup on one of Google Compute Engine's C4A machines, the gVNIC will fail to initialize: [ 1.071899] gvnic 0000:00:00.0: enabling device (0010 -> 0012) [ 1.076631] ACPI: \_SB_.PCI0.GSI2: Enabled at IRQ 37 [ 1.078075] nvme nvme0: pci function 0000:00:02.0 [ 1.093687] nvme nvme0: 4/0/0 default/read/poll queues [ 1.097563] nvme0n1: p1 p15 [ 3.886472] gvnic 0000:00:00.0: AQ commands timed out, need to reset AQ [ 3.888151] gvnic 0000:00:00.0: Could not get device information: err=-131 [ 3.891458] gvnic: probe of 0000:00:00.0 failed with error -131 Because this is a cloud instance, network failure means the instance is unusable. [Fix] A patchset to make the GVE driver work on both 64k page size and 4k page size kernels was applied in Linux 6.8, so Noble and later kernels all don't have this problem. Backporting the patchset to 5.15 appears to fix the issue, as I was able to boot and connect to the machine using the patched kernel. Patchset link: https://lore.kernel.org/all/20231128002648.320892-1-jfra...@google.com/ Hashes: 955f4d3bf0a45 ("gve: Perform adminq allocations through a dma_pool.") 8ae980d24195f ("gve: Deprecate adminq_pfn for pci revision 0x1.") ce260cb114bbf ("gve: Remove obsolete checks that rely on page size.") 513072fb4bf81 ("gve: Add page size register to the register_page_list command.") da7d4b42caf1b ("gve: Remove dependency on 4k page size.") [Test plan] Boot the 64k flavor of the patched kernel on a C4A Google Compute Engine instance, and verify that you can ssh to it. [Regression potential] Of the applied patches, "gve: Remove dependency on 4k page size." was the only one to have conflicts. It's possible that there are uses of the native PAGE_SIZE definition that aren't covered by the backport of the patch. This patchset is being without including other major GVE driver patchsets that had been applied before it in mainline. Since the patches are isolated to the GVE driver, and since generic-64k previously didn't work on gVNIC instances at all, the possibility of failure is limited to configurations which were already not working, therefore not regressions. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2109537/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp