On Fri, Mar 6, 2026 at 3:26 PM Xie, Chenglei <[email protected]> wrote:
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi Alex,
>
> amdgpu_fence_driver_force_completion() is working but it was called prior to 
> GPU reset.
>
> However, in this failing case, during a GPU reset, the KIQ ring is still used 
> (e.g. for HDP flush in amdgpu_kiq_hdp_flush()). Each of those submissions 
> emits a fence and increments sync_seq. The hardware queue is later cleared by 
> the reset and never runs those commands, so it never updates the fence 
> writeback location. After reset, the writeback memory still holds the last 
> value written before reset while sync_seq has moved far ahead.

Sounds like a bug in the reset sequence.  We should fix that so that
we don't use KIQ during the reset.

Alex

>
> Before emitting a new fence, amdgpu_fence_emit_polling() waits until the 
> oldest in-flight fence is done: it calls amdgpu_fence_wait_polling(ring, seq 
> - ring->fence_drv.num_fences_mask, timeout), which busy waits until the 
> writeback value is at least that sequence. Because the writeback value is 
> still near the pre-reset value and sync_seq has grown, the required sequence 
> (sync_seq - num_fences_mask) is much larger than the writeback value. The 
> driver therefore waits for completion of fences that were lost in the reset 
> and will never complete, the wait hits the timeout, and 
> amdgpu_fence_emit_polling() returns -ETIMEDOUT. The driver then refuses to 
> emit new KIQ fences, assuming the ring is full, and KIQ submissions 
> effectively stall.
>
> To fix this, when re-initializing the KIQ after a reset, the code now sets 
> the fence writeback memory to sync_seq. So it no longer waits for those lost 
> fences. amdgpu_fence_emit_polling() can then emit new fences without timing 
> out, and KIQ usage resumes after reset.
>
> Regards,
> Chenglei
> -----Original Message-----
> From: Alex Deucher <[email protected]>
> Sent: Friday, March 6, 2026 11:18 AM
> To: Xie, Chenglei <[email protected]>
> Cc: [email protected]; Chan, Hing Pong <[email protected]>; 
> Luo, Zhigang <[email protected]>; Zhang, Hawking <[email protected]>
> Subject: Re: [PATCH] drm/amdgpu: Fix KIQ fence timeout after GPU reset on GFX 
> v9.4.3
>
> [You don't often get email from [email protected]. Learn why this is 
> important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Tue, Mar 3, 2026 at 11:29 AM Chenglei Xie <[email protected]> wrote:
> >
> > After GPU reset, the hardware queue is cleared and all pending fences
> > are lost. However, the fence writeback memory remains stale from
> > before reset, while software continues emitting fences and sync_seq
> > keeps incrementing. This causes amdgpu_fence_emit_polling() to wait
> > for fences that were lost during reset, resulting in -ETIMEDOUT errors.
> >
> > Fix this by updating the fence writeback memory to match sync_seq
> > after GPU reset in gfx_v9_4_3_xcc_kiq_init_queue(). This aligns the
> > hardware's view of completed fences with software's view of emitted
> > fences, preventing timeouts when waiting for fences that no longer exist.
> >
> > Signed-off-by: Chenglei Xie <[email protected]>
> > Change-Id: I717df52ed0ef0bb51a6901f218191d9837a77f6f
> > ---
> >  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> > index ad4d442e7345e..6b5fcdd987693 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
> > @@ -2135,6 +2135,15 @@ static int gfx_v9_4_3_xcc_kiq_init_queue(struct 
> > amdgpu_ring *ring, int xcc_id)
> >                 gfx_v9_4_3_xcc_kiq_init_register(ring, xcc_id);
> >                 soc15_grbm_select(adev, 0, 0, 0, 0, GET_INST(GC, xcc_id));
> >                 mutex_unlock(&adev->srbm_mutex);
> > +
> > +               /* Update fence writeback memory to align with software 
> > state after reset.
> > +                * After GPU reset, the hardware queue is cleared and all 
> > pending fences
> > +                * are lost. The fence writeback memory may be stale from 
> > before reset. To prevent
> > +                * waiting for lost fences, update writeback memory to 
> > match sync_seq.
> > +                * This avoids waiting for lost fences and prevents 
> > timeouts.
> > +                */
>
> This doesn't make sense.  No other kiq_init_queue() function does this.  When 
> the GPU is reset, amdgpu_fence_driver_force_completion()
> should get called for each ring.  That will set an error on the fence and 
> update the fence sequence.  Why is that not working?
>
> Alex
>
> > +                if (ring->fence_drv.cpu_addr)
> > +                       *ring->fence_drv.cpu_addr =
> > + cpu_to_le32(ring->fence_drv.sync_seq);
> >         } else {
> >                 memset((void *)mqd, 0, sizeof(struct v9_mqd_allocation));
> >                 ((struct v9_mqd_allocation *)mqd)->dynamic_cu_mask =
> > 0xFFFFFFFF;
> > --
> > 2.34.1
> >

Reply via email to