## Root Cause Analysis: amdgpu MES SET_HW_RSRC_1 Timeout on Dell
Inspiron 16 DC16255 (Phoenix4 GPU)

### Summary

The boot failure is caused by the AMD MES (Microcode Engine Scheduler)
timing out (-110 / ETIMEDOUT) during `SET_HW_RSRC_1` initialization.
This is a kernel regression introduced by a backported commit, and two
upstream fix commits are missing from the `linux-oem-6.17` tree.

---

### Failure Chain (from journal log)

```
amdgpu: MES FW version must be >= 0x7f to enable LR compute workaround.
amdgpu: MES failed to respond to msg=SET_HW_RSRC_1
[drm:mes_v11_0_hw_init] *ERROR* failed mes_v11_0_set_hw_resources_1, r=-110
amdgpu: hw_init of IP block <gfx_v11_0> failed -110
amdgpu: amdgpu_device_ip_init failed
amdgpu: Fatal error during GPU init
amdgpu: finishing device.
```

amdgpu tears itself down mid-init, displacing the simpledrm framebuffer
without establishing a replacement DRM device. Xorg then enters an
infinite respawn loop (`open /dev/dri/card0: No such file or
directory`), making the system unusable without `nomodeset`.

---

### Hardware

- **Machine**: Dell Inspiron 16 DC16255 (BIOS 0.4.10 06/18/2025)
- **GPU**: AMD Phoenix4 `[1002:1901]` subsystem `[1028:0d95]` rev 0xc8, GC 
11.0.4 (DCN 3.1.4)
- **DMUB FW**: 0x08005300
- **MES FW (sched_version)**: < 0x7f (exact value not logged, but confirmed 
below threshold)
- **Kernel**: 6.17.0-1012-oem (`6.17.9`, built 2026-02-10)

---

### Root Cause: Two Compounding Bugs in linux-oem-6.17

#### Bug 1: MES LR compute workaround triggers SET_HW_RSRC_1 timeout

Upstream commit `1fb710793ce2` ("drm/amdgpu: Enable MES lr_compute_wa by
default") was backported into `linux-oem-6.17` as commit `8850944b17d3`.
This commit adds `enable_lr_compute_wa` to the `SET_HW_RESOURCES` MES
packet, conditional on MES FW version >= 0x7f.

On this machine, MES FW is below 0x7f — the kernel correctly skips the
LR compute WA bit and logs the warning message. **However, the
`SET_HW_RSRC_1` packet is still sent** to older MES firmware that does
not support it, causing the timeout.

The upstream fix is commit `6b0d812971370` ("drm/amd: Disable MES LR
compute W/A"), which completely removes the `enable_lr_compute_wa` logic
from `mes_v11_0.c`. The commit message states:

> "There are reports of instability on other products with newer GC
microcode versions, and I believe they're caused by this workaround. As
we don't need the workaround any more, remove it."

This commit carries `Cc: [email protected]` and is tagged as a fix
for the original workaround. **This commit is NOT present in linux-
oem-6.17.**

#### Bug 2: SET_HW_RSRC_1 version gate too low for GC 11.0.4

The `SET_HW_RSRC_1` call in `mes_v11_0_hw_init()` is gated on
`sched_version >= 0x50`. This machine's Phoenix4 (GC 11.0.4) has MES FW
at version 0x51, which passes the >= 0x50 check but does not correctly
support the `SET_HW_RSRC_1` packet, causing the ETIMEDOUT.

The upstream fix is commit `1478a34470bf` ("drm/amd: Set minimum version
for set_hw_resource_1 on gfx11 to 0x52"), which bumps the gate from `>=
0x50` to `>= 0x52`. Its commit message references:

> "GC 11.0.4 had breakage at MES 0x51. Bump the requirement to 0x52
instead."

Reported upstream at:
https://gitlab.freedesktop.org/drm/amd/-/issues/4576

This commit also carries `Cc: [email protected]`. **This commit is
also NOT present in linux-oem-6.17.**

---

### Fix Status

| Commit | Subject | In linux-oem-6.17? |
|--------|---------|---------------------|
| `1fb710793ce2` | drm/amdgpu: Enable MES lr_compute_wa by default | ✅ YES (as 
`8850944b17d3`) — **THIS IS THE REGRESSION** |
| `6b0d812971370` | drm/amd: Disable MES LR compute W/A | ❌ **MISSING** — needs 
backport |
| `1478a34470bf` | drm/amd: Set minimum version for set_hw_resource_1 on gfx11 
to 0x52 | ❌ **MISSING** — needs backport |

---

### Recommended Fix

Backport both commits to `linux-oem-6.17`:

1. `6b0d812971370c64b837a2db4275410f478272fe` — removes the MES LR compute 
workaround entirely (primary fix, `Cc: [email protected]`)
2. `1478a34470bf4755465d29b348b24a610bccc180` — bumps SET_HW_RSRC_1 minimum 
version gate from 0x50 to 0x52 (secondary fix, `Cc: [email protected]`)

Either fix alone may be sufficient (the version gate bump at 0x52 would
skip the SET_HW_RSRC_1 call on firmware 0x51), but backporting both is
advisable as the LR compute WA removal also addresses broader
instability reports on other products.

---

### Additional Observation (from Lspci.txt)

PCI device `0000:03:00.0` (AMD GPU) shows `Interrupt: pin A routed to
IRQ 255`. IRQ 255 is an invalid/unassigned interrupt; this is a
secondary anomaly in the collected data. However, since the `nomodeset`
boot (where amdgpu is not loaded) works correctly, this is likely a
consequence of the failed amdgpu probe rather than a contributing cause.

** Bug watch added: gitlab.freedesktop.org/drm/amd/-/issues #4576
   https://gitlab.freedesktop.org/drm/amd/-/issues/4576

** Tags added: amdgpu boot-regression gfx11 mes-timeout oem-priority
phoenix4

** Changed in: linux-oem-6.17 (Ubuntu)
   Importance: Undecided => High

** Changed in: linux-oem-6.17 (Ubuntu)
       Status: New => Confirmed

** Also affects: linux (Ubuntu)
   Importance: Undecided
       Status: New

** Also affects: linux (Ubuntu Questing)
   Importance: Undecided
       Status: New

** Also affects: linux-oem-6.17 (Ubuntu Questing)
   Importance: Undecided
       Status: New

** Also affects: linux (Ubuntu Noble)
   Importance: Undecided
       Status: New

** Also affects: linux-oem-6.17 (Ubuntu Noble)
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu Noble)
       Status: New => Invalid

** Changed in: linux (Ubuntu Questing)
       Status: New => In Progress

** Changed in: linux-oem-6.17 (Ubuntu Noble)
       Status: New => In Progress

** Changed in: linux-oem-6.17 (Ubuntu Questing)
       Status: New => Invalid

** Changed in: linux (Ubuntu Questing)
     Assignee: (unassigned) => AceLan Kao (acelankao)

** Changed in: linux-oem-6.17 (Ubuntu Noble)
     Assignee: (unassigned) => AceLan Kao (acelankao)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2144522

Title:
  Dell Machines cannot boot into OS with 6.17.0-1012-oem

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2144522/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to