On 05/02, Christian König wrote: > Stopping the scheduler for queue reset is generally a good idea because > it prevents any worker from touching the ring buffer. > > But using amdgpu_fence_driver_force_completion() before restarting it was > a really bad idea because it marked fences as failed while the work was > potentially still running. > > Stop doing that and cleanup the comment a bit. > > Signed-off-by: Christian König <[email protected]> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 27 ++++++++++++------------- > 1 file changed, 13 insertions(+), 14 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > index acb21fc8b3ce..a0fab947143b 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > @@ -136,10 +136,12 @@ static enum drm_gpu_sched_stat > amdgpu_job_timedout(struct drm_sched_job *s_job) > } else if (amdgpu_gpu_recovery && ring->funcs->reset) { > bool is_guilty; > > - dev_err(adev->dev, "Starting %s ring reset\n", > s_job->sched->name); > - /* stop the scheduler, but don't mess with the > - * bad job yet because if ring reset fails > - * we'll fall back to full GPU reset. > + dev_err(adev->dev, "Starting %s ring reset\n", > + s_job->sched->name); > + > + /* > + * Stop the scheduler to prevent anybody else from touching the > + * ring buffer. > */ > drm_sched_wqueue_stop(&ring->sched); > > @@ -157,19 +159,16 @@ static enum drm_gpu_sched_stat > amdgpu_job_timedout(struct drm_sched_job *s_job) > > r = amdgpu_ring_reset(ring, job->vmid); > if (!r) { > - if (amdgpu_ring_sched_ready(ring)) > - drm_sched_stop(&ring->sched, s_job); > - if (is_guilty) { > + if (is_guilty) > atomic_inc(&ring->adev->gpu_reset_counter); > - amdgpu_fence_driver_force_completion(ring); > - } > - if (amdgpu_ring_sched_ready(ring)) > - drm_sched_start(&ring->sched, 0); > - dev_err(adev->dev, "Ring %s reset succeeded\n", > ring->sched.name); > - drm_dev_wedged_event(adev_to_drm(adev), > DRM_WEDGE_RECOVERY_NONE); > + drm_sched_wqueue_start(&ring->sched); > + dev_err(adev->dev, "Ring %s reset succeeded\n", > + ring->sched.name); > + drm_dev_wedged_event(adev_to_drm(adev), > + DRM_WEDGE_RECOVERY_NONE); > goto exit; > } > - dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name); > + dev_err(adev->dev, "Ring %s reset failed\n", ring->sched.name); > } > dma_fence_set_error(&s_job->s_fence->finished, -ETIME); > > -- > 2.34.1 >
Hi Christian, I tested your series with Polaris and the 6600 series with the HangTest suite, and with your series, the following tests failed: - soft_recovery_loop - soft_recovery_pagefault_read - soft_recovery_pagefault_write When using the latest code of the amd-staging-drm-next, all of the above tests behave well. I'll try to identify what is going on. Thanks -- Rodrigo Siqueira
