On 05/22, Alex Deucher wrote: > From: Christian König <[email protected]> > > Stopping the scheduler for queue reset is generally a good idea because > it prevents any worker from touching the ring buffer. > > But using amdgpu_fence_driver_force_completion() before restarting it was > a really bad idea because it marked fences as failed while the work was > potentially still running. > > Stop doing that and cleanup the comment a bit. > > v2: keep amdgpu_fence_driver_force_completion() for non-gfx rings
Why keep this amdgpu_fence_driver_force_completion() for non-gfx is ok? >From the commit descriptions, sounds like we want to avoid amdgpu_fence_driver_force_completion() before the driver restarts the queue. > > Reviewed-by: Alex Deucher <[email protected]> > Signed-off-by: Christian König <[email protected]> > Signed-off-by: Alex Deucher <[email protected]> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 26 +++++++++++++------------ > 1 file changed, 14 insertions(+), 12 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > index acb21fc8b3ce5..e57401ef85140 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > @@ -136,10 +136,12 @@ static enum drm_gpu_sched_stat > amdgpu_job_timedout(struct drm_sched_job *s_job) > } else if (amdgpu_gpu_recovery && ring->funcs->reset) { > bool is_guilty; > > - dev_err(adev->dev, "Starting %s ring reset\n", > s_job->sched->name); > - /* stop the scheduler, but don't mess with the > - * bad job yet because if ring reset fails > - * we'll fall back to full GPU reset. > + dev_err(adev->dev, "Starting %s ring reset\n", > + s_job->sched->name); > + > + /* > + * Stop the scheduler to prevent anybody else from touching the > + * ring buffer. > */ > drm_sched_wqueue_stop(&ring->sched); > > @@ -157,19 +159,19 @@ static enum drm_gpu_sched_stat > amdgpu_job_timedout(struct drm_sched_job *s_job) > > r = amdgpu_ring_reset(ring, job->vmid); > if (!r) { > - if (amdgpu_ring_sched_ready(ring)) > - drm_sched_stop(&ring->sched, s_job); > if (is_guilty) { > atomic_inc(&ring->adev->gpu_reset_counter); > - amdgpu_fence_driver_force_completion(ring); > + if (ring->funcs->type != AMDGPU_RING_TYPE_GFX) > + > amdgpu_fence_driver_force_completion(ring); > } > - if (amdgpu_ring_sched_ready(ring)) > - drm_sched_start(&ring->sched, 0); > - dev_err(adev->dev, "Ring %s reset succeeded\n", > ring->sched.name); > - drm_dev_wedged_event(adev_to_drm(adev), > DRM_WEDGE_RECOVERY_NONE); > + drm_sched_wqueue_start(&ring->sched); > + dev_err(adev->dev, "Ring %s reset succeeded\n", > + ring->sched.name); > + drm_dev_wedged_event(adev_to_drm(adev), > + DRM_WEDGE_RECOVERY_NONE); > goto exit; > } > - dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name); > + dev_err(adev->dev, "Ring %s reset failed\n", ring->sched.name); > } > dma_fence_set_error(&s_job->s_fence->finished, -ETIME); > > -- > 2.49.0 > -- Rodrigo Siqueira
