I investigated this on the affected DUT (CID 202505-36753, Dell Pro Max
18 Plus MB18250 / Kingda Ka 18, Intel Core Ultra 9 285HX).

The initial symptom was that nvidia-dkms-580-open failed during
DKMS/module build for 6.17.0-1025-oem, with GCC-13 reporting internal
compiler errors / segmentation faults.

The important finding is that this is not a generic NVIDIA DKMS source
issue, and not a generic 6.17 header issue. The build succeeds against
the same NVIDIA 580 source and the same 6.17 headers when the workload
is kept away from a specific CPU cluster.

Key observations:

* A one-time boot into 6.17.0-1025-oem now succeeds after package/initrd 
recovery. NVIDIA 580 loads under 6.17, so the previous 6.17 boot failure was 
likely due to incomplete package/initrd state after the failed DKMS 
configuration, not a runtime NVIDIA boot blocker.
* Building NVIDIA 580 against 6.17.0-1025-oem succeeds when constrained to CPUs 
0-7.
* Package recovery also succeeded after offlining CPUs 8-23 and running dpkg 
--configure -a.
* Replaying one previously failing translation unit by itself succeeded. This 
argues against a deterministic single-source-file or single-header compiler bug.
* The failure point moved across different source/header locations between 
runs. This also argues against one fixed source construct being the root cause.

CPU-affinity tests narrowed the failure:

  CPUs 0-7    with make -j8: pass
  CPUs 8-15   with make -j8: pass
  CPUs 16-23  with make -j8: fail
  CPUs 8-11   with make -j4: pass
  CPUs 12-15  with make -j4: pass
  CPUs 16-19  with make -j4: hang / DUT wedged

Earlier, while running 6.11.0-1022-oem and building for 6.17.0-1025-oem,
the failure produced kernel oopses while cc1 was running:

  BUG: unable to handle page fault
  RIP: post_alloc_hook+0x3b/0x150
  note: cc1[...] exited with irqs disabled
  CPU: 19

Under 6.17.0-1025-oem, the bad CPU group still caused GCC/conftest
failures, but I did not observe the same post_alloc_hook() oops in that
run.

The NVIDIA conftest result was also affected. On good CPUs the generated
result was:

  #undef NV_DRM_GEM_OBJECT_PUT_UNLOCK_PRESENT

On the bad CPU group, one failed build generated:

  #define NV_DRM_GEM_OBJECT_PUT_UNLOCK_PRESENT

That later caused a secondary compile error about
drm_gem_object_put_unlocked(). This indicates the bad CPU region can
produce unreliable compile/conftest behavior, not only a hard GCC crash.

Current interpretation:

This bug is best treated as a platform/kernel/firmware/hardware issue
exposed by the NVIDIA DKMS build workload, not as an nvidia-graphics-
drivers-580 source bug. The strongest suspect on the affected DUT is CPU
cluster 16-19, especially CPU 19.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2158237

Title:
  nvidia-dkms-580-open attempts a DKMS build despite prebuilt linux-
  modules-nvidia-580-open packages being installed

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-580/+bug/2158237/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to