On Thu, May 29, 2025 at 4:54 PM Alex Deucher <[email protected]> wrote: > > On Thu, May 29, 2025 at 4:08 PM Alex Deucher <[email protected]> > wrote: > > > > This set improves per queue reset support for GC10+. > > When we reset the queue, the queue is lost so we need > > to re-emit the unprocessed state from subsequent submissions. > > To that end, in order to make sure we actually restore > > unprocessed state, we need to enable legacy enforce isolation > > so that we can safely re-emit the unprocessed state. If > > we don't multiple jobs can run in parallel and we may not > > end up resetting the correct one. This is similar to how > > windows handles queues. This also gives us correct guilty > > tracking for GC. > > > > Tested on GC 10 and 11 chips with a game running and > > then running hang tests. The game pauses when the > > hang happens, then continues after the queue reset. > > > > I tried this same approach and GC8 and 9, but it > > was not as reliable as soft recovery. As such, I've dropped > > the KGQ reset code for pre-GC10. > > > > The same approach is extended to SDMA and VCN. > > They don't need enforce isolation because those engines > > are single threaded so they always operate serially. > > > > Rework re-emit to signal the seq number of the bad job and > > verify that to verify that the reset worked, then re-emit the > > rest of the non-guilty state. This way we are not waiting on > > the rest of the state to complete, and if the subsequent state > > also contains a bad job, we'll end up in queue reset again rather > > than adapter reset. > > git tree available here: > https://gitlab.freedesktop.org/agd5f/linux/-/commits/kq_resets?ref_type=heads
I've pushed several fixes since I last sent this and will continue to push updates. Alex > > Alex > > > > > v4: Drop explicit padding patches > > Drop new timeout macro > > Rework re-emit sequence > > v5: Add a helper for reemit > > Convert VCN, JPEG, SDMA to use new helpers > > > > Alex Deucher (27): > > drm/amdgpu: enable legacy enforce isolation by default > > drm/amdgpu/gfx7: drop reset_kgq > > drm/amdgpu/gfx8: drop reset_kgq > > drm/amdgpu/gfx9: drop reset_kgq > > drm/amdgpu: move force completion into ring resets > > drm/amdgpu: track ring state associated with a job > > drm/amdgpu/gfx10: re-emit unprocessed state on ring reset > > drm/amdgpu/gfx11: re-emit unprocessed state on ring reset > > drm/amdgpu/gfx12: re-emit unprocessed state on ring reset > > drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset > > drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset > > drm/amdgpu/sdma5: re-emit unprocessed state on ring reset > > drm/amdgpu/sdma5.2: re-emit unprocessed state on ring reset > > drm/amdgpu/sdma6: re-emit unprocessed state on ring reset > > drm/amdgpu/sdma7: re-emit unprocessed state on ring reset > > drm/amdgpu/jpeg2: re-emit unprocessed state on ring reset > > drm/amdgpu/jpeg2.5: re-emit unprocessed state on ring reset > > drm/amdgpu/jpeg3: re-emit unprocessed state on ring reset > > drm/amdgpu/jpeg4: re-emit unprocessed state on ring reset > > drm/amdgpu/jpeg4.0.3: re-emit unprocessed state on ring reset > > drm/amdgpu/jpeg5.0.0: add queue reset > > drm/amdgpu/jpeg5: re-emit unprocessed state on ring reset > > drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset > > drm/amdgpu/vcn4: re-emit unprocessed state on ring reset > > drm/amdgpu/vcn4.0.3: re-emit unprocessed state on ring reset > > drm/amdgpu/vcn4.0.5: re-emit unprocessed state on ring reset > > drm/amdgpu/vcn5: re-emit unprocessed state on ring reset > > > > Christian König (1): > > drm/amdgpu: rework queue reset scheduler interaction > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +- > > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 12 ++++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 6 ++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 32 +++++----- > > drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 2 + > > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 46 ++++++++++++++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 8 +++ > > drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 31 ++-------- > > drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 21 +------ > > drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 21 +------ > > drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c | 71 ---------------------- > > drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c | 71 ---------------------- > > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 51 +--------------- > > drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 6 +- > > drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c | 3 +- > > drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c | 3 +- > > drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c | 3 +- > > drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 3 +- > > drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 3 +- > > drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c | 12 ++++ > > drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 3 +- > > drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 4 ++ > > drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 7 ++- > > drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 7 ++- > > drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 6 +- > > drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 6 +- > > drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 2 +- > > drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 3 +- > > drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 2 +- > > drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 2 +- > > 30 files changed, 162 insertions(+), 289 deletions(-) > > > > -- > > 2.49.0 > >
