On Wed, 2025-11-12 at 13:13 +0000, Tvrtko Ursulin wrote: > > On 12/11/2025 12:15, Philipp Stanner wrote: > > On Wed, 2025-11-12 at 09:42 +0000, Tvrtko Ursulin wrote: > > > > > > On 12/11/2025 07:31, Philipp Stanner wrote: > > > > drm_sched_entity_push_job() uses the unlocked spsc_queue. It takes a > > > > reference to that queue's tip at the start, and some time later removes > > > > that entry from that list, without locking or protection against > > > > preemption. > > > > > > I couldn't figure out what you refer to by tip reference at the start, > > > and later removes it. Are you talking about the top level view from > > > drm_sched_entity_push_job() or where exactly? > > > > This is by design, since the spsc_queue demands single producer and > > > > single consumer. It was, however, never documented. > > > > > > > > Document that you must not call drm_sched_entity_push_job() in parallel > > > > for the same entity. > > > > > > > > Signed-off-by: Philipp Stanner <[email protected]> > > > > --- > > > > drivers/gpu/drm/scheduler/sched_entity.c | 3 +++ > > > > 1 file changed, 3 insertions(+) > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c > > > > b/drivers/gpu/drm/scheduler/sched_entity.c > > > > index 5a4697f636f2..b31e8d14aa20 100644 > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > > > > @@ -562,6 +562,9 @@ void drm_sched_entity_select_rq(struct > > > > drm_sched_entity *entity) > > > > * drm_sched_entity_push_job - Submit a job to the entity's job queue > > > > * @sched_job: job to submit > > > > * > > > > + * It is illegal to call this function in parallel, at least for jobs > > > > belonging > > > > + * to the same entity. Doing so leads to undefined behavior. > > > > > > One thing that is documented in the very next paragraph is that the > > > design implies a lock held between arm and push. At least to ensure > > > seqno order matches the queue order. > > > > > > I did not get what other breakage you found, but I also previously did > > > find something other than that. Hm.. if I could only remember what it > > > was. Probably mine was something involving drm_sched_entity_select_rq(), > > > drm_sched_entity_modify_sched() and (theoretical) multi-threaded > > > userspace submit on the same entity. Luckily it seems no one does that. > > > > > > The issue you found is separate and not theoretically fixed by this > > > hypothetical common lock held over arm and push? > > > > Well, if 2 CPUs should ever run in parallel in > > drm_sched_entity_push_job() the spsc_queue will just explode. Most > > notably, one CPU could get the job at the tip (the oldest job), then be > > preempted, and then another CPU takes the same job and pops it. > > Ah, you are talking about the dequeue/pop side. First paragraph of the > commit message can be clarified in that case. > > Pop is serialised by the worker so I don't think two simultaneous > dequeues on the same scheduler are possible. How did you trigger it? > > The API contract should be that the user doesn't have to know whether > > there's a linked list or the magic spsc_queue.Luckily we moved the peek/pop > > helpers to sched_internal.h. > > Btw I thought you gave up on the scheduler and are working on the simple > rust queue for firmware schedulers so how come you are finding subtle > bugs in this code?
I'm a maintainer still, for a variety of reasons. That we work on something for FW with a clean locking design doesn't mean we don't care about existing infrastructure anymore. And I firmly believe in documentation. People should know the rules from the API documentation, not from looking into the implementation. It's kind of a crucial design info that you must only push into entities sequentially, no? This doesn't fix a bug, obviously, since it's just a line of docu. Regardless, pushing into the spsc queue in parallel would explode. Emphasizing that costs as nothing. P.
