Public bug reported:

[Impact]

During startup on one of Google Compute Engine's C4A machines, the gVNIC will 
fail to initialize:
[    1.071899] gvnic 0000:00:00.0: enabling device (0010 -> 0012)
[    1.076631] ACPI: \_SB_.PCI0.GSI2: Enabled at IRQ 37
[    1.078075] nvme nvme0: pci function 0000:00:02.0
[    1.093687] nvme nvme0: 4/0/0 default/read/poll queues
[    1.097563]  nvme0n1: p1 p15
[    3.886472] gvnic 0000:00:00.0: AQ commands timed out, need to reset AQ
[    3.888151] gvnic 0000:00:00.0: Could not get device information: err=-131
[    3.891458] gvnic: probe of 0000:00:00.0 failed with error -131

Because this is a cloud instance, network failure means the instance is
unusable.

[Fix]

A patchset to make the GVE driver work on both 64k page size and 4k page size 
kernels was applied in Linux 6.8, so Noble and later kernels all don't have 
this problem. Backporting the patchset to 5.15 appears to fix the issue, as I 
was able to boot and connect to the machine using the patched kernel.
Patchset link: 
https://lore.kernel.org/all/20231128002648.320892-1-jfra...@google.com/
Hashes:
955f4d3bf0a45 ("gve: Perform adminq allocations through a dma_pool.")
8ae980d24195f ("gve: Deprecate adminq_pfn for pci revision 0x1.")
ce260cb114bbf ("gve: Remove obsolete checks that rely on page size.")
513072fb4bf81 ("gve: Add page size register to the register_page_list command.")
da7d4b42caf1b ("gve: Remove dependency on 4k page size.")

[Test plan]

Boot the 64k flavor of the patched kernel on a C4A Google Compute Engine
instance, and verify that you can ssh to it.

[Regression potential]

Of the applied patches, "gve: Remove dependency on 4k page size." was
the only one to have conflicts. It's possible that there are uses of the
native PAGE_SIZE definition that aren't covered by the backport of the
patch. This patchset is being without including other major GVE driver
patchsets that had been applied before it in mainline.

Since the patches are isolated to the GVE driver, and since generic-64k
previously didn't work on gVNIC instances at all, the possibility of
failure is limited to configurations which were already not working,
therefore not regressions.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

** Affects: linux (Ubuntu Jammy)
     Importance: Undecided
         Status: New

** Also affects: linux (Ubuntu Jammy)
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2109537

Title:
  Jammy generic-64k fails to initialize gVNIC devices

Status in linux package in Ubuntu:
  New
Status in linux source package in Jammy:
  New

Bug description:
  [Impact]

  During startup on one of Google Compute Engine's C4A machines, the gVNIC will 
fail to initialize:
  [    1.071899] gvnic 0000:00:00.0: enabling device (0010 -> 0012)
  [    1.076631] ACPI: \_SB_.PCI0.GSI2: Enabled at IRQ 37
  [    1.078075] nvme nvme0: pci function 0000:00:02.0
  [    1.093687] nvme nvme0: 4/0/0 default/read/poll queues
  [    1.097563]  nvme0n1: p1 p15
  [    3.886472] gvnic 0000:00:00.0: AQ commands timed out, need to reset AQ
  [    3.888151] gvnic 0000:00:00.0: Could not get device information: err=-131
  [    3.891458] gvnic: probe of 0000:00:00.0 failed with error -131

  Because this is a cloud instance, network failure means the instance
  is unusable.

  [Fix]

  A patchset to make the GVE driver work on both 64k page size and 4k page size 
kernels was applied in Linux 6.8, so Noble and later kernels all don't have 
this problem. Backporting the patchset to 5.15 appears to fix the issue, as I 
was able to boot and connect to the machine using the patched kernel.
  Patchset link: 
https://lore.kernel.org/all/20231128002648.320892-1-jfra...@google.com/
  Hashes:
  955f4d3bf0a45 ("gve: Perform adminq allocations through a dma_pool.")
  8ae980d24195f ("gve: Deprecate adminq_pfn for pci revision 0x1.")
  ce260cb114bbf ("gve: Remove obsolete checks that rely on page size.")
  513072fb4bf81 ("gve: Add page size register to the register_page_list 
command.")
  da7d4b42caf1b ("gve: Remove dependency on 4k page size.")

  [Test plan]

  Boot the 64k flavor of the patched kernel on a C4A Google Compute
  Engine instance, and verify that you can ssh to it.

  [Regression potential]

  Of the applied patches, "gve: Remove dependency on 4k page size." was
  the only one to have conflicts. It's possible that there are uses of
  the native PAGE_SIZE definition that aren't covered by the backport of
  the patch. This patchset is being without including other major GVE
  driver patchsets that had been applied before it in mainline.

  Since the patches are isolated to the GVE driver, and since
  generic-64k previously didn't work on gVNIC instances at all, the
  possibility of failure is limited to configurations which were already
  not working, therefore not regressions.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2109537/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to