Update: this is NOT a separate bug from the short freezes - it is the same
display-core failure, and it reproduces on the 6.19.10 mainline kernel that
several of us have been using as a workaround.

Today I hit a ~2 minute full freeze on 6.19.10 (not 7.0.x). From my
side:

- Counter-Strike 2 on monitor 1, YouTube on monitor 2.
- ~18:24 both monitors froze. After a few seconds monitor 1 recovered on its
  own; monitor 2 stayed completely frozen on a single YouTube frame.
- Closing Chrome did nothing - the frozen frame stayed on screen.
- I opened Settings -> Display and toggled the second monitor. Both screens
  went inactive, then both came back. Recovery at ~18:26.

journalctl -k for that window:

May 29 18:24:01 kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [CRTC:367:crtc-1] 
flip_done timed out
May 29 18:24:01 kernel: amdgpu 0000:03:00.0: amdgpu: [drm] *ERROR* 
[CRTC:367:crtc-1] hw_done or flip_done timed out
May 29 18:25:12 kernel: workqueue: dm_handle_vmin_vmax_update [amdgpu] hogged 
CPU for >10000us 19 times, consider switching to WQ_UNBOUND
May 29 18:25:32 kernel: amdgpu 0000:03:00.0: [drm] *ERROR* flip_done timed out
May 29 18:25:32 kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [CRTC:367:crtc-1] 
commit wait timed out
May 29 18:25:43 kernel: amdgpu 0000:03:00.0: [drm] *ERROR* flip_done timed out
May 29 18:25:43 kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [CONNECTOR:387:DP-2] 
commit wait timed out
May 29 18:25:53 kernel: amdgpu 0000:03:00.0: [drm] *ERROR* flip_done timed out
May 29 18:25:53 kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:148:plane-2] 
commit wait timed out
May 29 18:25:53 kernel: amdgpu 0000:03:00.0: [drm] REG_WAIT timeout 1us * 100 
tries - dcn32_program_compbuf_size line:147

Key points:

1. Same signatures as the 7.0.0 reports. dm_handle_vmin_vmax_update hogging,
   the dcn32_program_compbuf_size REG_WAIT timeout, and flip_done / commit wait
   timeouts are exactly the functions seen on 7.0.0-15. So 6.19.10 is NOT free
   of this defect.

2. This is a display/KMS commit wedge, not a GPU hang. There is no ring
   timeout, no GPU reset, and no IP block message anywhere in the window. The
   GPU kept rendering (the game continued underneath) - only the
   presentation/atomic commit on one pipe got stuck.

3. The freeze was localized to crtc-1 / DP-2 (my second monitor). Only that
   CRTC appears in the errors; monitor 1's pipe recovered on its own, which
   matches what I saw. Closing Chrome changed nothing because the wedge is at
   the hardware-commit level, not tied to the app holding the buffer.

4. The 18:25:32 / :43 / :53 timeouts are my own modeset attempts from Settings
   also timing out because the pipe was still wedged. The disable -> re-enable
   finally forced a full modeset that reset the pipe and recovered it.

5. The REG_WAIT itself is short (1us * 100 tries = ~100us), so it is the tell,
   not the cause: dcn32_program_compbuf_size waited for a hardware ack that
   never arrived, so the pipe programming never completed, flip_done never
   signaled, and the atomic commit hung until the manual modeset. This same
   REG_WAIT failure path therefore exists on DC 3.2.359 (6.19.10), not only on
   DC 3.2.369 (7.0.0).

Frequency / workaround caveat:

I have been running 6.19.10 as a workaround for 20 days now. In those 20
days this is the second time I have had a ~2 minute full freeze. The first
time was identical: both screens froze, monitor 1 recovered, monitor 2 stayed
frozen for a few minutes, then recovered on its own. Separately, the
"hogged CPU ... 19 times" counter is cumulative since boot, which shows the
short-freeze path is still active on 6.19.10 - I just rarely notice the
sub-second ones.

Conclusion:

6.19.10 reduces the frequency of the freezes but does not fix them. The same
dm_handle_vmin_vmax_update / dcn32_program_compbuf_size / flip_done failure
path is present on both 6.19.10 (DC 3.2.359) and 7.0.0-15 (DC 3.2.369). This
looks less like a clean "introduced in 7.0" regression and more like a latent
display-core defect that 7.0 made fire far more often. It is worth flagging to
the amdgpu display team that the dcn32_program_compbuf_size REG_WAIT timeout
and the resulting flip-completion wedge reproduce on 3.2.359 as well, so a fix
targeting only the 7.0 frequency change may not cover the underlying pipe
wedge.

For anyone using a 6.19.x kernel as a workaround: it is a mitigation, not a
fix - the same total freeze can still happen.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2150776

Title:
  Ubuntu 26.04 GNOME Wayland: random short display/presentation freezes
  on AMD RX 7900 XT while apps continue running

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2150776/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to