On Fri, Mar 6, 2026 at 3:26 PM Xie, Chenglei <[email protected]> wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > Hi Alex, > > amdgpu_fence_driver_force_completion() is working but it was called prior to > GPU reset. > > However, in this failing case, during a GPU reset, the KIQ ring is still used > (e.g. for HDP flush in amdgpu_kiq_hdp_flush()). Each of those submissions > emits a fence and increments sync_seq. The hardware queue is later cleared by > the reset and never runs those commands, so it never updates the fence > writeback location. After reset, the writeback memory still holds the last > value written before reset while sync_seq has moved far ahead.
Sounds like a bug in the reset sequence. We should fix that so that we don't use KIQ during the reset. Alex > > Before emitting a new fence, amdgpu_fence_emit_polling() waits until the > oldest in-flight fence is done: it calls amdgpu_fence_wait_polling(ring, seq > - ring->fence_drv.num_fences_mask, timeout), which busy waits until the > writeback value is at least that sequence. Because the writeback value is > still near the pre-reset value and sync_seq has grown, the required sequence > (sync_seq - num_fences_mask) is much larger than the writeback value. The > driver therefore waits for completion of fences that were lost in the reset > and will never complete, the wait hits the timeout, and > amdgpu_fence_emit_polling() returns -ETIMEDOUT. The driver then refuses to > emit new KIQ fences, assuming the ring is full, and KIQ submissions > effectively stall. > > To fix this, when re-initializing the KIQ after a reset, the code now sets > the fence writeback memory to sync_seq. So it no longer waits for those lost > fences. amdgpu_fence_emit_polling() can then emit new fences without timing > out, and KIQ usage resumes after reset. > > Regards, > Chenglei > -----Original Message----- > From: Alex Deucher <[email protected]> > Sent: Friday, March 6, 2026 11:18 AM > To: Xie, Chenglei <[email protected]> > Cc: [email protected]; Chan, Hing Pong <[email protected]>; > Luo, Zhigang <[email protected]>; Zhang, Hawking <[email protected]> > Subject: Re: [PATCH] drm/amdgpu: Fix KIQ fence timeout after GPU reset on GFX > v9.4.3 > > [You don't often get email from [email protected]. Learn why this is > important at https://aka.ms/LearnAboutSenderIdentification ] > > On Tue, Mar 3, 2026 at 11:29 AM Chenglei Xie <[email protected]> wrote: > > > > After GPU reset, the hardware queue is cleared and all pending fences > > are lost. However, the fence writeback memory remains stale from > > before reset, while software continues emitting fences and sync_seq > > keeps incrementing. This causes amdgpu_fence_emit_polling() to wait > > for fences that were lost during reset, resulting in -ETIMEDOUT errors. > > > > Fix this by updating the fence writeback memory to match sync_seq > > after GPU reset in gfx_v9_4_3_xcc_kiq_init_queue(). This aligns the > > hardware's view of completed fences with software's view of emitted > > fences, preventing timeouts when waiting for fences that no longer exist. > > > > Signed-off-by: Chenglei Xie <[email protected]> > > Change-Id: I717df52ed0ef0bb51a6901f218191d9837a77f6f > > --- > > drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 9 +++++++++ > > 1 file changed, 9 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c > > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c > > index ad4d442e7345e..6b5fcdd987693 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c > > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c > > @@ -2135,6 +2135,15 @@ static int gfx_v9_4_3_xcc_kiq_init_queue(struct > > amdgpu_ring *ring, int xcc_id) > > gfx_v9_4_3_xcc_kiq_init_register(ring, xcc_id); > > soc15_grbm_select(adev, 0, 0, 0, 0, GET_INST(GC, xcc_id)); > > mutex_unlock(&adev->srbm_mutex); > > + > > + /* Update fence writeback memory to align with software > > state after reset. > > + * After GPU reset, the hardware queue is cleared and all > > pending fences > > + * are lost. The fence writeback memory may be stale from > > before reset. To prevent > > + * waiting for lost fences, update writeback memory to > > match sync_seq. > > + * This avoids waiting for lost fences and prevents > > timeouts. > > + */ > > This doesn't make sense. No other kiq_init_queue() function does this. When > the GPU is reset, amdgpu_fence_driver_force_completion() > should get called for each ring. That will set an error on the fence and > update the fence sequence. Why is that not working? > > Alex > > > + if (ring->fence_drv.cpu_addr) > > + *ring->fence_drv.cpu_addr = > > + cpu_to_le32(ring->fence_drv.sync_seq); > > } else { > > memset((void *)mqd, 0, sizeof(struct v9_mqd_allocation)); > > ((struct v9_mqd_allocation *)mqd)->dynamic_cu_mask = > > 0xFFFFFFFF; > > -- > > 2.34.1 > >
