[Bug 2158237] Re: nvidia-dkms-580-open attempts a DKMS build despite prebuilt linux-modules-nvidia-580-open packages being installed

Anthony Wong Sat, 27 Jun 2026 23:05:40 -0700

I investigated this on the affected DUT (CID 202505-36753, Dell Pro Max
18 Plus MB18250 / Kingda Ka 18, Intel Core Ultra 9 285HX).

The initial symptom was that nvidia-dkms-580-open failed during
DKMS/module build for 6.17.0-1025-oem, with GCC-13 reporting internal
compiler errors / segmentation faults.

The important finding is that this is not a generic NVIDIA DKMS source
issue, and not a generic 6.17 header issue. The build succeeds against
the same NVIDIA 580 source and the same 6.17 headers when the workload
is kept away from a specific CPU cluster.

Key observations:

* A one-time boot into 6.17.0-1025-oem now succeeds after package/initrd
recovery. NVIDIA 580 loads under 6.17, so the previous 6.17 boot failure was
likely due to incomplete package/initrd state after the failed DKMS
configuration, not a runtime NVIDIA boot blocker.
* Building NVIDIA 580 against 6.17.0-1025-oem succeeds when constrained to CPUs
0-7.
* Package recovery also succeeded after offlining CPUs 8-23 and running dpkg
--configure -a.
* Replaying one previously failing translation unit by itself succeeded. This
argues against a deterministic single-source-file or single-header compiler bug.
* The failure point moved across different source/header locations between
runs. This also argues against one fixed source construct being the root cause.

CPU-affinity tests narrowed the failure:

CPUs 0-7 with make -j8: pass
CPUs 8-15 with make -j8: pass
CPUs 16-23 with make -j8: fail
CPUs 8-11 with make -j4: pass
CPUs 12-15 with make -j4: pass
CPUs 16-19 with make -j4: hang / DUT wedged

Earlier, while running 6.11.0-1022-oem and building for 6.17.0-1025-oem,
the failure produced kernel oopses while cc1 was running:

BUG: unable to handle page fault
RIP: post_alloc_hook+0x3b/0x150
note: cc1[...] exited with irqs disabled
CPU: 19

Under 6.17.0-1025-oem, the bad CPU group still caused GCC/conftest
failures, but I did not observe the same post_alloc_hook() oops in that
run.

The NVIDIA conftest result was also affected. On good CPUs the generated
result was:

#undef NV_DRM_GEM_OBJECT_PUT_UNLOCK_PRESENT

On the bad CPU group, one failed build generated:

#define NV_DRM_GEM_OBJECT_PUT_UNLOCK_PRESENT

That later caused a secondary compile error about
drm_gem_object_put_unlocked(). This indicates the bad CPU region can
produce unreliable compile/conftest behavior, not only a hard GCC crash.

Current interpretation:

This bug is best treated as a platform/kernel/firmware/hardware issue
exposed by the NVIDIA DKMS build workload, not as an nvidia-graphics-
drivers-580 source bug. The strongest suspect on the affected DUT is CPU
cluster 16-19, especially CPU 19.

--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2158237

Title:
nvidia-dkms-580-open attempts a DKMS build despite prebuilt linux-
modules-nvidia-580-open packages being installed

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-580/+bug/2158237/+subscriptions

--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2158237] Re: nvidia-dkms-580-open attempts a DKMS build despite prebuilt linux-modules-nvidia-580-open packages being installed

Reply via email to