On Tue, 2026-06-23 at 08:19 +0100, Tvrtko Ursulin wrote:
> On 17/06/2026 09:38, Philipp Stanner wrote:
> > > + /*
> > > +  * Make sure this entity is not used by the scheduler at the moment.
> > > +  *
> > > +  * Scheduler is guaranteed to be stable after the entity was stopped and
> > > +  * removed from the run-queue.
> > > +  */
> > > + if (sched)
> > > +         drm_sched_flush_run_work(sched);
> > >   
> > > - /* The entity is guaranteed to not be used by the scheduler */
> > >           prev = rcu_dereference_check(entity->last_scheduled, true);
> > >           dma_fence_get(prev);
> > >           while ((job = drm_sched_entity_queue_pop(entity))) {
> > > @@ -576,6 +578,13 @@ void drm_sched_entity_select_rq(struct 
> > > drm_sched_entity *entity)
> > >                   return;
> > >   
> > >           spin_lock(&entity->lock);
> > > +
> > > + if (entity->stopped) {
> > > +         spin_unlock(&entity->lock);
> > > +         return;
> > > +
> > > + }
> > 
> > Seems unrelated? Why wasn't this needed semantically before?
> 
> It solidifies the guarantee drm_sched_entity_kill() expects that the 
> scheduler assigned to an entity cannot change after entity has been stopped.

That in general I like a lot, of course. As you probably know my
"dream" for drm_sched is to get rid of as much lockless magic as
possible, shown by the entries I've added to Documentation/gpu/todo.rst
(on drm-misc-next).

> 
> There we have this sequence:
> 
>    spin_lock(&entity->lock);
>    entity->stopped = true;
>    sched = drm_sched_rq_remove_entity(entity->rq, entity);
>    spin_unlock(&entity->lock);
> 
>    if (sched)
>       drm_sched_flush_run_work(sched);
> 
> That is, without that check, in theory, an evil driver could race 
> drm_sched_entity_select_rq() (via drm_sched_job_arm()) and 
> drm_sched_entity_kill(). I am not sure if any driver can actually do 
> that at the moment but it felt sensible to express it in code.

The devil is in the "at the moment".

AFAICS and as your explanation sounds, this is an existing problem that
is unrelated to the Flush RFC. So I suppose that this should be a
separate patch (independent from this one) which precisely focusses on
this robustness work.

A dedicated patch probably should then be concerned about the 'stopped'
bool's synchronization in general and should investigate the lockless
check in drm_sched_entity_is_idle(), too. The memory barrier there
seems only concerned about the list…


> 
> > > +
> > >           sched = drm_sched_pick_best(entity->sched_list, 
> > > entity->num_sched_list);
> > >           rq = sched ? &sched->rq : NULL;
> > >           if (rq != entity->rq) {
> > > diff --git a/drivers/gpu/drm/scheduler/sched_internal.h 
> > > b/drivers/gpu/drm/scheduler/sched_internal.h
> > > index 13ecb771d7a2..80dece3be415 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_internal.h
> > > +++ b/drivers/gpu/drm/scheduler/sched_internal.h
> > > @@ -35,12 +35,22 @@ bool drm_sched_can_queue(struct drm_gpu_scheduler 
> > > *sched,
> > >                            struct drm_sched_entity *entity);
> > >   void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
> > >   
> > > +/**
> > > + * drm_sched_flush_run_work - flush the run-job work
> > 
> > In v1, you'd probably want to document what this function typically
> > will be used for :)
> 
> Well its in the scheduler _internal_ header and I am not sure what to
> write which will add real value.
> 
> "Only used to make sure a stopped entity is not in use by the scheduler 
> workers."

I think the magic word that provides value to newbies trying to get
familiar with our complex code base is "synchronization".

"A synchronization helper used to make sure that no scheduler worker
does access this entity anymore."


Kind greetings,
Philipp

Reply via email to