On Tue, Mar 3, 2026 at 11:29 AM Chenglei Xie <[email protected]> wrote:
>
> After GPU reset, the hardware queue is cleared and all pending fences
> are lost. However, the fence writeback memory remains stale from before
> reset, while software continues emitting fences and sync_seq keeps
> incrementing. This causes amdgpu_fence_emit_polling() to wait for
> fences that were lost during reset, resulting in -ETIMEDOUT errors.
>
> Fix this by updating the fence writeback memory to match sync_seq after
> GPU reset in gfx_v9_4_3_xcc_kiq_init_queue(). This aligns the hardware's
> view of completed fences with software's view of emitted fences,
> preventing timeouts when waiting for fences that no longer exist.
>
> Signed-off-by: Chenglei Xie <[email protected]>
> Change-Id: I717df52ed0ef0bb51a6901f218191d9837a77f6f
> ---
> drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index ad4d442e7345e..6b5fcdd987693 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -2135,6 +2135,15 @@ static int gfx_v9_4_3_xcc_kiq_init_queue(struct
> amdgpu_ring *ring, int xcc_id)
> gfx_v9_4_3_xcc_kiq_init_register(ring, xcc_id);
> soc15_grbm_select(adev, 0, 0, 0, 0, GET_INST(GC, xcc_id));
> mutex_unlock(&adev->srbm_mutex);
> +
> + /* Update fence writeback memory to align with software state
> after reset.
> + * After GPU reset, the hardware queue is cleared and all
> pending fences
> + * are lost. The fence writeback memory may be stale from
> before reset. To prevent
> + * waiting for lost fences, update writeback memory to match
> sync_seq.
> + * This avoids waiting for lost fences and prevents timeouts.
> + */
This doesn't make sense. No other kiq_init_queue() function does
this. When the GPU is reset, amdgpu_fence_driver_force_completion()
should get called for each ring. That will set an error on the fence
and update the fence sequence. Why is that not working?
Alex
> + if (ring->fence_drv.cpu_addr)
> + *ring->fence_drv.cpu_addr =
> cpu_to_le32(ring->fence_drv.sync_seq);
> } else {
> memset((void *)mqd, 0, sizeof(struct v9_mqd_allocation));
> ((struct v9_mqd_allocation *)mqd)->dynamic_cu_mask =
> 0xFFFFFFFF;
> --
> 2.34.1
>