Just FYI. We got similiar issue on the oem projects, and the issue could be reproduced even with 6.17 kernel. The root cause could be the virtual monitor dongle, we swap it with a real monitor, then we can't reproduce the issue anymore.
-- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2096860 Title: lvl 5 pagetable system hang Status in linux package in Ubuntu: Confirmed Status in linux-hwe-6.8 package in Ubuntu: New Bug description: A hang occurs with a possible kernel BUG at arch/x86/mm/init_64.c:154 during the memmap_init_zone_device initialization call in the AMDGPU init sequence. When the kernel BUG error occurs, this is the expected good result after the [drm] JPEG decode line. memmap_init_zone_device should execute, then amdgpum HMM, and this is where the kernel BUG happens. ========================= Aug 09 00:07:09.659512 host-ruby-942e kernel: [drm] JPEG decode initialized successfully. Aug 09 00:07:09.659521 host-ruby-942e kernel: memmap_init_zone_device initialised 16777216 pages in 136ms Aug 09 00:07:09.659531 host-ruby-942e kernel: amdgpum HMM registered 65520MB device memory Aug 09 00:07:09.659694 host-ruby-942e kernel: kfd kfd: amdgpu: Allocated 3989536 bytes on gart Aug 09 00:07:09.659838 host-ruby-942e kernel: kfd kfd: amdgpu: Total number of KFD nodes to be created: 1 Aug 09 00:07:09.659849 host-ruby-942e kernel: amdgpu: Virtual CRAT table created for GPU Aug 09 00:07:09.659858 host-ruby-942e kernel: amdgpu: Topology: Add dGPU node [0x740f:0x1002] Aug 09 00:07:09.659985 host-ruby-942e kernel: kfd kfd: amdgpu: added device 1002:740f ==================== The issue is a timing-related race condition when setting up the CPU page tables during the AMDGPU driver initialization. The potential issue could fall under Linux memory management for this 5-level page table error The issue occurs during a server reboot stress. Server environment should have at least 1 x AMD MI210 GPU with amd gpu driver installed and enabled. Use ipmitool to drive chassis cold boot in a loop with loop count set to 1000. We are able to reliably reproduce this issue beyond 500 boot cycles. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2096860/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : [email protected] Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp

