I investigated this on the affected DUT (CID 202505-36753, Dell Pro Max 18 Plus MB18250 / Kingda Ka 18, Intel Core Ultra 9 285HX).
The initial symptom was that nvidia-dkms-580-open failed during DKMS/module build for 6.17.0-1025-oem, with GCC-13 reporting internal compiler errors / segmentation faults. The important finding is that this is not a generic NVIDIA DKMS source issue, and not a generic 6.17 header issue. The build succeeds against the same NVIDIA 580 source and the same 6.17 headers when the workload is kept away from a specific CPU cluster. Key observations: * A one-time boot into 6.17.0-1025-oem now succeeds after package/initrd recovery. NVIDIA 580 loads under 6.17, so the previous 6.17 boot failure was likely due to incomplete package/initrd state after the failed DKMS configuration, not a runtime NVIDIA boot blocker. * Building NVIDIA 580 against 6.17.0-1025-oem succeeds when constrained to CPUs 0-7. * Package recovery also succeeded after offlining CPUs 8-23 and running dpkg --configure -a. * Replaying one previously failing translation unit by itself succeeded. This argues against a deterministic single-source-file or single-header compiler bug. * The failure point moved across different source/header locations between runs. This also argues against one fixed source construct being the root cause. CPU-affinity tests narrowed the failure: CPUs 0-7 with make -j8: pass CPUs 8-15 with make -j8: pass CPUs 16-23 with make -j8: fail CPUs 8-11 with make -j4: pass CPUs 12-15 with make -j4: pass CPUs 16-19 with make -j4: hang / DUT wedged Earlier, while running 6.11.0-1022-oem and building for 6.17.0-1025-oem, the failure produced kernel oopses while cc1 was running: BUG: unable to handle page fault RIP: post_alloc_hook+0x3b/0x150 note: cc1[...] exited with irqs disabled CPU: 19 Under 6.17.0-1025-oem, the bad CPU group still caused GCC/conftest failures, but I did not observe the same post_alloc_hook() oops in that run. The NVIDIA conftest result was also affected. On good CPUs the generated result was: #undef NV_DRM_GEM_OBJECT_PUT_UNLOCK_PRESENT On the bad CPU group, one failed build generated: #define NV_DRM_GEM_OBJECT_PUT_UNLOCK_PRESENT That later caused a secondary compile error about drm_gem_object_put_unlocked(). This indicates the bad CPU region can produce unreliable compile/conftest behavior, not only a hard GCC crash. Current interpretation: This bug is best treated as a platform/kernel/firmware/hardware issue exposed by the NVIDIA DKMS build workload, not as an nvidia-graphics- drivers-580 source bug. The strongest suspect on the affected DUT is CPU cluster 16-19, especially CPU 19. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2158237 Title: nvidia-dkms-580-open attempts a DKMS build despite prebuilt linux- modules-nvidia-580-open packages being installed To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-580/+bug/2158237/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
