Just FYI.
We got similiar issue on the oem projects, and the issue could be reproduced 
even with 6.17 kernel.
The root cause could be the virtual monitor dongle, we swap it with a real 
monitor, then we can't reproduce the issue anymore.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2096860

Title:
  lvl 5 pagetable system hang

Status in linux package in Ubuntu:
  Confirmed
Status in linux-hwe-6.8 package in Ubuntu:
  New

Bug description:
  A hang occurs with a possible kernel BUG at arch/x86/mm/init_64.c:154
  during the memmap_init_zone_device initialization call in the AMDGPU
  init sequence.

  When the kernel BUG error occurs, this is the expected good result
  after the [drm] JPEG decode line. memmap_init_zone_device should
  execute, then amdgpum HMM, and this is where the kernel BUG happens.

  =========================

  Aug 09 00:07:09.659512 host-ruby-942e kernel: [drm] JPEG decode initialized 
successfully.
  Aug 09 00:07:09.659521 host-ruby-942e kernel: memmap_init_zone_device 
initialised 16777216 pages in 136ms
  Aug 09 00:07:09.659531 host-ruby-942e kernel: amdgpum HMM registered 65520MB 
device memory
  Aug 09 00:07:09.659694 host-ruby-942e kernel: kfd kfd: amdgpu: Allocated 
3989536 bytes on gart
  Aug 09 00:07:09.659838 host-ruby-942e kernel: kfd kfd: amdgpu: Total number 
of KFD nodes to be created: 1
  Aug 09 00:07:09.659849 host-ruby-942e kernel: amdgpu: Virtual CRAT table 
created for GPU
  Aug 09 00:07:09.659858 host-ruby-942e kernel: amdgpu: Topology: Add dGPU node 
[0x740f:0x1002]
  Aug 09 00:07:09.659985 host-ruby-942e kernel: kfd kfd: amdgpu: added device 
1002:740f
  ====================

  The issue is a timing-related race condition when setting up the CPU
  page tables during the AMDGPU driver initialization. The potential
  issue could fall under Linux memory management for this 5-level page
  table error


  The issue occurs during a server reboot stress. Server environment
  should have at least 1 x AMD MI210 GPU with amd gpu driver installed
  and enabled. Use ipmitool to drive chassis cold boot in a loop with
  loop count set to 1000. We are able to reliably reproduce this issue
  beyond 500 boot cycles.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2096860/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to