+Cc Matthew, Danilo, who are also drm/sched maintainers

The get_maintainer script provides you with a list of maintainers.

On Mon, 2025-12-01 at 09:38 +0100, Christian König wrote:
> On 11/28/25 19:22, [email protected] wrote:
> > From: Vitaly Prosyak <[email protected]>
> > 
> > Currently drm_sched runs run_job and free_job on the per-scheduler
> > ordered submit workqueue, while timeouts (drm_sched_job_timedout())
> > run on sched->timeout_wq (e.g. amdgpu reset_domain->wq).
> > 
> > For drivers like amdgpu the reset path entered from timeout_wq may
> > still access the guilty job while free_job, running on submit_wq,
> > frees it. This allows a use-after-free when recovery continues to
> > touch job fields after drm_sched_free_job_work() has called
> > ops->free_job().
> > 
> > Queue work_free_job on sched->timeout_wq instead of submit_wq, both
> > from __drm_sched_run_free_queue() and drm_sched_wqueue_start(), so          
> > timeout/reset and free_job are always serialized on the same
> > workqueue.
> > 
> > Behavior changes:
> > 
> > - work_run_job stays on sched->submit_wq (ordered).
> > - work_free_job moves to sched->timeout_wq (timeout/reset queue).
> > - Submission and freeing may now run in parallel on different
> >   workqueues, but all shared state is already protected by
> >   job_list_lock and atomics.
> > 
> > Pros:
> > - Eliminates reset vs free_job race (no freeing while reset still
> >   uses the job).

It eliminates such a race *in amdgpu*. Other drivers might not have
that problem. I think Intel/Xe is refcounting jobs? Matthew?

> > - Matches the logical model: timeout selects guilty job and recovery,
> >   including freeing, is handled on one queue.
> > 
> > Cons / considerations:
> > - For users that don’t provide timeout_wq, free_job moves from the
> >   per-sched ordered queue to system_wq, which slightly changes
> >   scheduling behaviour but keeps correctness.
> 
> We should probably avoid that and use a single ordered wq for submit, 
> timeout, free when the driver doesn't provide one.

AFAICS this fix doesn't fix anything for certain at all, because you
just don't know what kind of workqueue the other drivers have passed
for timeout_wq.

> 
> We should potentially also add a warning/error when the driver supplied wq 
> isn't ordered.
> 
> Apart from that the change looks sane to me and avoid all the hacky 
> workarounds around job lifetime.

I'm not convinced that this is the right path forward.

The fundamental issue here is the gross design problem within drm/sched
that drivers *create* jobs, but the scheduler *frees* them at an
unpredictable point in time.

This entire fix idea seems to circle around the concept of relying yet
again on the scheduler's internal behavior (i.e., when it schedules the
call to free_job()).

I think we discussed that at XDC, but how I see it if drivers have
strange job life time requirements where a job shall outlive
drm_sched's free_job() call, they must solve that with a proper
synchronization mechanism.

The first question would be: what does amdgpu need the job for after
free_job() ran? What do you even need a job for still after there was a
timeout?

And if you really still need its contents, can't you memcpy() the job
or something?

Assuming that it really needs it and that that cannot easily be solved,
I suppose the obvious answer for differing memory life times is
refcounting. So amdgpu could just let drm_sched drop its reference in
free_job(), and from then onward it's amdgpu's problem.

I hope Matthew can educate us on how Xe does it.

AFAIK Nouveau doesn't have that problem, because on timeout we just
terminate the channel.

Would also be interesting to hear whether other driver folks have the
problem of free_job() being racy.

+Cc Boris, Lucas.


P.


> 
> But removing those workarounds is should probably be a second step.
> 
> Regards,
> Christian.
> 
> > 
> > Cc: Philipp Stanner <[email protected]>
> > Cc: Alex Deucher <[email protected]>
> > Cc: Christian König <[email protected]>
> > Suggested-by: Mathew from Intel during XDC
> > Suggested-by: Christian König <[email protected]>
> > Signed-off-by: Vitaly Prosyak <[email protected]>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index 81ad40d9582b..1243200d475e 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -355,7 +355,7 @@ static void drm_sched_run_job_queue(struct 
> > drm_gpu_scheduler *sched)
> >  static void __drm_sched_run_free_queue(struct drm_gpu_scheduler *sched)
> >  {
> >     if (!READ_ONCE(sched->pause_submit))
> > -           queue_work(sched->submit_wq, &sched->work_free_job);
> > +           queue_work(sched->timeout_wq, &sched->work_free_job);
> >  }
> >  
> >  /**
> > @@ -1493,6 +1493,6 @@ void drm_sched_wqueue_start(struct drm_gpu_scheduler 
> > *sched)
> >  {
> >     WRITE_ONCE(sched->pause_submit, false);
> >     queue_work(sched->submit_wq, &sched->work_run_job);
> > -   queue_work(sched->submit_wq, &sched->work_free_job);
> > +   queue_work(sched->timeout_wq, &sched->work_free_job);
> >  }
> >  EXPORT_SYMBOL(drm_sched_wqueue_start);
> 

Reply via email to