On Thu, May 29, 2025 at 4:08 PM Alex Deucher <[email protected]> wrote: > > This set improves per queue reset support for GC10+. > When we reset the queue, the queue is lost so we need > to re-emit the unprocessed state from subsequent submissions. > To that end, in order to make sure we actually restore > unprocessed state, we need to enable legacy enforce isolation > so that we can safely re-emit the unprocessed state. If > we don't multiple jobs can run in parallel and we may not > end up resetting the correct one. This is similar to how > windows handles queues. This also gives us correct guilty > tracking for GC. > > Tested on GC 10 and 11 chips with a game running and > then running hang tests. The game pauses when the > hang happens, then continues after the queue reset. > > I tried this same approach and GC8 and 9, but it > was not as reliable as soft recovery. As such, I've dropped > the KGQ reset code for pre-GC10. > > The same approach is extended to SDMA and VCN. > They don't need enforce isolation because those engines > are single threaded so they always operate serially. > > Rework re-emit to signal the seq number of the bad job and > verify that to verify that the reset worked, then re-emit the > rest of the non-guilty state. This way we are not waiting on > the rest of the state to complete, and if the subsequent state > also contains a bad job, we'll end up in queue reset again rather > than adapter reset.
git tree available here: https://gitlab.freedesktop.org/agd5f/linux/-/commits/kq_resets?ref_type=heads Alex > > v4: Drop explicit padding patches > Drop new timeout macro > Rework re-emit sequence > v5: Add a helper for reemit > Convert VCN, JPEG, SDMA to use new helpers > > Alex Deucher (27): > drm/amdgpu: enable legacy enforce isolation by default > drm/amdgpu/gfx7: drop reset_kgq > drm/amdgpu/gfx8: drop reset_kgq > drm/amdgpu/gfx9: drop reset_kgq > drm/amdgpu: move force completion into ring resets > drm/amdgpu: track ring state associated with a job > drm/amdgpu/gfx10: re-emit unprocessed state on ring reset > drm/amdgpu/gfx11: re-emit unprocessed state on ring reset > drm/amdgpu/gfx12: re-emit unprocessed state on ring reset > drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset > drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset > drm/amdgpu/sdma5: re-emit unprocessed state on ring reset > drm/amdgpu/sdma5.2: re-emit unprocessed state on ring reset > drm/amdgpu/sdma6: re-emit unprocessed state on ring reset > drm/amdgpu/sdma7: re-emit unprocessed state on ring reset > drm/amdgpu/jpeg2: re-emit unprocessed state on ring reset > drm/amdgpu/jpeg2.5: re-emit unprocessed state on ring reset > drm/amdgpu/jpeg3: re-emit unprocessed state on ring reset > drm/amdgpu/jpeg4: re-emit unprocessed state on ring reset > drm/amdgpu/jpeg4.0.3: re-emit unprocessed state on ring reset > drm/amdgpu/jpeg5.0.0: add queue reset > drm/amdgpu/jpeg5: re-emit unprocessed state on ring reset > drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset > drm/amdgpu/vcn4: re-emit unprocessed state on ring reset > drm/amdgpu/vcn4.0.3: re-emit unprocessed state on ring reset > drm/amdgpu/vcn4.0.5: re-emit unprocessed state on ring reset > drm/amdgpu/vcn5: re-emit unprocessed state on ring reset > > Christian König (1): > drm/amdgpu: rework queue reset scheduler interaction > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 12 ++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 6 ++ > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 32 +++++----- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 2 + > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 46 ++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 8 +++ > drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 31 ++-------- > drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 21 +------ > drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 21 +------ > drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c | 71 ---------------------- > drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c | 71 ---------------------- > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 51 +--------------- > drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 6 +- > drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c | 3 +- > drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c | 3 +- > drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c | 3 +- > drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c | 3 +- > drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 3 +- > drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c | 12 ++++ > drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 3 +- > drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 4 ++ > drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 7 ++- > drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 7 ++- > drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 6 +- > drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 6 +- > drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c | 2 +- > drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 3 +- > drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c | 2 +- > drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 2 +- > 30 files changed, 162 insertions(+), 289 deletions(-) > > -- > 2.49.0 >
