[AMD Official Use Only - AMD Internal Distribution Only]

Hi Alex,

amdgpu_fence_driver_force_completion() is working but it was called prior to 
GPU reset.

However, in this failing case, during a GPU reset, the KIQ ring is still used 
(e.g. for HDP flush in amdgpu_kiq_hdp_flush()). Each of those submissions emits 
a fence and increments sync_seq. The hardware queue is later cleared by the 
reset and never runs those commands, so it never updates the fence writeback 
location. After reset, the writeback memory still holds the last value written 
before reset while sync_seq has moved far ahead.

Before emitting a new fence, amdgpu_fence_emit_polling() waits until the oldest 
in-flight fence is done: it calls amdgpu_fence_wait_polling(ring, seq - 
ring->fence_drv.num_fences_mask, timeout), which busy waits until the writeback 
value is at least that sequence. Because the writeback value is still near the 
pre-reset value and sync_seq has grown, the required sequence (sync_seq - 
num_fences_mask) is much larger than the writeback value. The driver therefore 
waits for completion of fences that were lost in the reset and will never 
complete, the wait hits the timeout, and amdgpu_fence_emit_polling() returns 
-ETIMEDOUT. The driver then refuses to emit new KIQ fences, assuming the ring 
is full, and KIQ submissions effectively stall.

To fix this, when re-initializing the KIQ after a reset, the code now sets the 
fence writeback memory to sync_seq. So it no longer waits for those lost 
fences. amdgpu_fence_emit_polling() can then emit new fences without timing 
out, and KIQ usage resumes after reset.

Regards,
Chenglei
-----Original Message-----
From: Alex Deucher <[email protected]>
Sent: Friday, March 6, 2026 11:18 AM
To: Xie, Chenglei <[email protected]>
Cc: [email protected]; Chan, Hing Pong <[email protected]>; Luo, 
Zhigang <[email protected]>; Zhang, Hawking <[email protected]>
Subject: Re: [PATCH] drm/amdgpu: Fix KIQ fence timeout after GPU reset on GFX 
v9.4.3

[You don't often get email from [email protected]. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification ]

On Tue, Mar 3, 2026 at 11:29 AM Chenglei Xie <[email protected]> wrote:
>
> After GPU reset, the hardware queue is cleared and all pending fences
> are lost. However, the fence writeback memory remains stale from
> before reset, while software continues emitting fences and sync_seq
> keeps incrementing. This causes amdgpu_fence_emit_polling() to wait
> for fences that were lost during reset, resulting in -ETIMEDOUT errors.
>
> Fix this by updating the fence writeback memory to match sync_seq
> after GPU reset in gfx_v9_4_3_xcc_kiq_init_queue(). This aligns the
> hardware's view of completed fences with software's view of emitted
> fences, preventing timeouts when waiting for fences that no longer exist.
>
> Signed-off-by: Chenglei Xie <[email protected]>
> Change-Id: I717df52ed0ef0bb51a6901f218191d9837a77f6f
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index ad4d442e7345e..6b5fcdd987693 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -2135,6 +2135,15 @@ static int gfx_v9_4_3_xcc_kiq_init_queue(struct 
> amdgpu_ring *ring, int xcc_id)
>                 gfx_v9_4_3_xcc_kiq_init_register(ring, xcc_id);
>                 soc15_grbm_select(adev, 0, 0, 0, 0, GET_INST(GC, xcc_id));
>                 mutex_unlock(&adev->srbm_mutex);
> +
> +               /* Update fence writeback memory to align with software state 
> after reset.
> +                * After GPU reset, the hardware queue is cleared and all 
> pending fences
> +                * are lost. The fence writeback memory may be stale from 
> before reset. To prevent
> +                * waiting for lost fences, update writeback memory to match 
> sync_seq.
> +                * This avoids waiting for lost fences and prevents timeouts.
> +                */

This doesn't make sense.  No other kiq_init_queue() function does this.  When 
the GPU is reset, amdgpu_fence_driver_force_completion()
should get called for each ring.  That will set an error on the fence and 
update the fence sequence.  Why is that not working?

Alex

> +                if (ring->fence_drv.cpu_addr)
> +                       *ring->fence_drv.cpu_addr =
> + cpu_to_le32(ring->fence_drv.sync_seq);
>         } else {
>                 memset((void *)mqd, 0, sizeof(struct v9_mqd_allocation));
>                 ((struct v9_mqd_allocation *)mqd)->dynamic_cu_mask =
> 0xFFFFFFFF;
> --
> 2.34.1
>

Reply via email to