On Tue, Mar 3, 2026 at 11:29 AM Chenglei Xie <[email protected]> wrote:
>
> After GPU reset, the hardware queue is cleared and all pending fences
> are lost. However, the fence writeback memory remains stale from before
> reset, while software continues emitting fences and sync_seq keeps
> incrementing. This causes amdgpu_fence_emit_polling() to wait for
> fences that were lost during reset, resulting in -ETIMEDOUT errors.
>
> Fix this by updating the fence writeback memory to match sync_seq after
> GPU reset in gfx_v9_4_3_xcc_kiq_init_queue(). This aligns the hardware's
> view of completed fences with software's view of emitted fences,
> preventing timeouts when waiting for fences that no longer exist.
>
> Signed-off-by: Chenglei Xie <[email protected]>
> Change-Id: I717df52ed0ef0bb51a6901f218191d9837a77f6f
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> index ad4d442e7345e..6b5fcdd987693 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> @@ -2135,6 +2135,15 @@ static int gfx_v9_4_3_xcc_kiq_init_queue(struct 
> amdgpu_ring *ring, int xcc_id)
>                 gfx_v9_4_3_xcc_kiq_init_register(ring, xcc_id);
>                 soc15_grbm_select(adev, 0, 0, 0, 0, GET_INST(GC, xcc_id));
>                 mutex_unlock(&adev->srbm_mutex);
> +
> +               /* Update fence writeback memory to align with software state 
> after reset.
> +                * After GPU reset, the hardware queue is cleared and all 
> pending fences
> +                * are lost. The fence writeback memory may be stale from 
> before reset. To prevent
> +                * waiting for lost fences, update writeback memory to match 
> sync_seq.
> +                * This avoids waiting for lost fences and prevents timeouts.
> +                */

This doesn't make sense.  No other kiq_init_queue() function does
this.  When the GPU is reset, amdgpu_fence_driver_force_completion()
should get called for each ring.  That will set an error on the fence
and update the fence sequence.  Why is that not working?

Alex

> +                if (ring->fence_drv.cpu_addr)
> +                       *ring->fence_drv.cpu_addr = 
> cpu_to_le32(ring->fence_drv.sync_seq);
>         } else {
>                 memset((void *)mqd, 0, sizeof(struct v9_mqd_allocation));
>                 ((struct v9_mqd_allocation *)mqd)->dynamic_cu_mask = 
> 0xFFFFFFFF;
> --
> 2.34.1
>

Reply via email to