On 23/06/2026 09:48, Philipp Stanner wrote:
On Tue, 2026-06-23 at 08:19 +0100, Tvrtko Ursulin wrote:
On 17/06/2026 09:38, Philipp Stanner wrote:
+       /*
+        * Make sure this entity is not used by the scheduler at the moment.
+        *
+        * Scheduler is guaranteed to be stable after the entity was stopped and
+        * removed from the run-queue.
+        */
+       if (sched)
+               drm_sched_flush_run_work(sched);
- /* The entity is guaranteed to not be used by the scheduler */
        prev = rcu_dereference_check(entity->last_scheduled, true);
        dma_fence_get(prev);
        while ((job = drm_sched_entity_queue_pop(entity))) {
@@ -576,6 +578,13 @@ void drm_sched_entity_select_rq(struct drm_sched_entity 
*entity)
                return;
   spin_lock(&entity->lock);
+
+       if (entity->stopped) {
+               spin_unlock(&entity->lock);
+               return;
+
+       }

Seems unrelated? Why wasn't this needed semantically before?

It solidifies the guarantee drm_sched_entity_kill() expects that the
scheduler assigned to an entity cannot change after entity has been stopped.

That in general I like a lot, of course. As you probably know my
"dream" for drm_sched is to get rid of as much lockless magic as
possible, shown by the entries I've added to Documentation/gpu/todo.rst
(on drm-misc-next).


There we have this sequence:

    spin_lock(&entity->lock);
    entity->stopped = true;
    sched = drm_sched_rq_remove_entity(entity->rq, entity);
    spin_unlock(&entity->lock);

    if (sched)
        drm_sched_flush_run_work(sched);

That is, without that check, in theory, an evil driver could race
drm_sched_entity_select_rq() (via drm_sched_job_arm()) and
drm_sched_entity_kill(). I am not sure if any driver can actually do
that at the moment but it felt sensible to express it in code.

The devil is in the "at the moment".

AFAICS and as your explanation sounds, this is an existing problem that
is unrelated to the Flush RFC. So I suppose that this should be a
separate patch (independent from this one) which precisely focusses on
this robustness work.

Catch is, and why I thought it is justified to do this it in this patch, is because before completion was in the entity so whats happening with entity->rq->sched after entity is stopped was irrelevant. With this patch is it relevant on paper so I considered it prudent to express that in the code as well as the comment.

Acceptable or unacceptable?

A dedicated patch probably should then be concerned about the 'stopped'
bool's synchronization in general and should investigate the lockless
check in drm_sched_entity_is_idle(), too. The memory barrier there
seems only concerned about the list…



+
        sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
        rq = sched ? &sched->rq : NULL;
        if (rq != entity->rq) {
diff --git a/drivers/gpu/drm/scheduler/sched_internal.h 
b/drivers/gpu/drm/scheduler/sched_internal.h
index 13ecb771d7a2..80dece3be415 100644
--- a/drivers/gpu/drm/scheduler/sched_internal.h
+++ b/drivers/gpu/drm/scheduler/sched_internal.h
@@ -35,12 +35,22 @@ bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
                         struct drm_sched_entity *entity);
   void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
+/**
+ * drm_sched_flush_run_work - flush the run-job work

In v1, you'd probably want to document what this function typically
will be used for :)

Well its in the scheduler _internal_ header and I am not sure what to
write which will add real value.

"Only used to make sure a stopped entity is not in use by the scheduler
workers."

I think the magic word that provides value to newbies trying to get
familiar with our complex code base is "synchronization".

"A synchronization helper used to make sure that no scheduler worker
does access this entity anymore."

Wilco.

Regards,

Tvrtko


Reply via email to