Re: [PATCH] drm/sched: Document racy behavior of drm_sched_entity_push_job()

Philipp Stanner Wed, 12 Nov 2025 05:31:36 -0800

On Wed, 2025-11-12 at 13:13 +0000, Tvrtko Ursulin wrote:
> 
> On 12/11/2025 12:15, Philipp Stanner wrote:
> > On Wed, 2025-11-12 at 09:42 +0000, Tvrtko Ursulin wrote:
> > > 
> > > On 12/11/2025 07:31, Philipp Stanner wrote:
> > > > drm_sched_entity_push_job() uses the unlocked spsc_queue. It takes a
> > > > reference to that queue's tip at the start, and some time later removes
> > > > that entry from that list, without locking or protection against
> > > > preemption.
> > > 
> > > I couldn't figure out what you refer to by tip reference at the start,
> > > and later removes it. Are you talking about the top level view from
> > > drm_sched_entity_push_job() or where exactly?
> > > > This is by design, since the spsc_queue demands single producer and
> > > > single consumer. It was, however, never documented.
> > > > 
> > > > Document that you must not call drm_sched_entity_push_job() in parallel
> > > > for the same entity.
> > > > 
> > > > Signed-off-by: Philipp Stanner <[email protected]>
> > > > ---
> > > >    drivers/gpu/drm/scheduler/sched_entity.c | 3 +++
> > > >    1 file changed, 3 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
> > > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > index 5a4697f636f2..b31e8d14aa20 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > @@ -562,6 +562,9 @@ void drm_sched_entity_select_rq(struct 
> > > > drm_sched_entity *entity)
> > > >     * drm_sched_entity_push_job - Submit a job to the entity's job queue
> > > >     * @sched_job: job to submit
> > > >     *
> > > > + * It is illegal to call this function in parallel, at least for jobs 
> > > > belonging
> > > > + * to the same entity. Doing so leads to undefined behavior.
> > > 
> > > One thing that is documented in the very next paragraph is that the
> > > design implies a lock held between arm and push. At least to ensure
> > > seqno order matches the queue order.
> > > 
> > > I did not get what other breakage you found, but I also previously did
> > > find something other than that. Hm.. if I could only remember what it
> > > was. Probably mine was something involving drm_sched_entity_select_rq(),
> > > drm_sched_entity_modify_sched() and (theoretical) multi-threaded
> > > userspace submit on the same entity. Luckily it seems no one does that.
> > > 
> > > The issue you found is separate and not theoretically fixed by this
> > > hypothetical common lock held over arm and push?
> > 
> > Well, if 2 CPUs should ever run in parallel in
> > drm_sched_entity_push_job() the spsc_queue will just explode. Most
> > notably, one CPU could get the job at the tip (the oldest job), then be
> > preempted, and then another CPU takes the same job and pops it.
> 
> Ah, you are talking about the dequeue/pop side. First paragraph of the 
> commit message can be clarified in that case.
> 
> Pop is serialised by the worker so I don't think two simultaneous 
> dequeues on the same scheduler are possible. How did you trigger it?
> > The API contract should be that the user doesn't have to know whether
> > there's a linked list or the magic spsc_queue.Luckily we moved the peek/pop 
> > helpers to sched_internal.h.
> 
> Btw I thought you gave up on the scheduler and are working on the simple 
> rust queue for firmware schedulers so how come you are finding subtle
> bugs in this code?


I'm a maintainer still, for a variety of reasons. That we work on
something for FW with a clean locking design doesn't mean we don't care
about existing infrastructure anymore. And I firmly believe in
documentation. People should know the rules from the API documentation,
not from looking into the implementation.

It's kind of a crucial design info that you must only push into
entities sequentially, no?

This doesn't fix a bug, obviously, since it's just a line of docu.
Regardless, pushing into the spsc queue in parallel would explode.
Emphasizing that costs as nothing.



P.

Re: [PATCH] drm/sched: Document racy behavior of drm_sched_entity_push_job()

Reply via email to