On Fri, 2026-07-03 at 12:27 +0100, Tvrtko Ursulin wrote:
>
> On 01/07/2026 09:59, Philipp Stanner wrote:
> > The entity->last_scheduled field has always been set and read with
> > special RCU functions in addition to memory barriers.
> >
> > This was added in
> >
> > commit 70102d77ff22 ("drm/scheduler: add drm_sched_entity_error and use rcu
> > for last_scheduled")
> >
> > however, no proper justification for that mechanism was provided. There
> > seems to be no obvious reason, since the entity lock is available and
> > taken at all places that evaluate the last_scheduled field. The only
> > exception is drm_sched_entity_error(), which is not performance critical
> > in any way.
> >
> > Improve robustness, readability and maintainability by replacing RCU and
> > barriers with the lock.
>
> First thing, and regardless of other strands of discussion, I think it
> should be squashed with 3/5 instead of that one undoing the introduction
> of lock-unlock-lock-unlock.
I agree that there should not be a do-undo pattern, but I don't want to
squash that, it's quite a distinctive action. One patch adds locks, the
other moves them.
But what I can do is move that patch before №1 here so that it becomes
understandable as a preparational commit.
>
> For what the main topic is concerned, I really like the removal of all
> the rcu_dereference_check(, true) lines and the memory barriers.
>
> But I also think the commit message should explain better what code
> paths are now taking an extra lock - under which circumstances is the
> lock now taken for all scheduler users, and which amdgpu paths use
> drm_sched_entity_error() a lot so could be affected. I doubt it creates
> a measurable performance impact but it needs to be explained.
I think it can detail which functions will now be locked; but
mentioning the users would be overkill and is uncommon for API reworks.
>
> I am also happy to give it a spin on the Steam Deck to see if I can
> observe anything.
Could be interesting.
>
> > Signed-off-by: Philipp Stanner <[email protected]>
> > ---
> > drivers/gpu/drm/scheduler/sched_entity.c | 50 ++++++++++--------------
> > include/drm/gpu_scheduler.h | 9 ++---
> > 2 files changed, 25 insertions(+), 34 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > b/drivers/gpu/drm/scheduler/sched_entity.c
> > index c51101ec70c1..91aec20611ad 100644
> > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > @@ -135,7 +135,6 @@ int drm_sched_entity_init(struct drm_sched_entity
> > *entity,
> > entity->num_sched_list = num_sched_list;
> > entity->sched_list = num_sched_list > 1 ? sched_list : NULL;
> > entity->rq = &sched_list[0]->rq;
> > - RCU_INIT_POINTER(entity->last_scheduled, NULL);
> > RB_CLEAR_NODE(&entity->rb_tree_node);
> > init_completion(&entity->entity_idle);
> >
> > @@ -201,10 +200,10 @@ int drm_sched_entity_error(struct drm_sched_entity
> > *entity)
> > struct dma_fence *fence;
> > int r;
> >
> > - rcu_read_lock();
> > - fence = rcu_dereference(entity->last_scheduled);
> > + spin_lock(&entity->lock);
> > + fence = entity->last_scheduled;
> > r = fence ? fence->error : 0;
> > - rcu_read_unlock();
> > + spin_unlock(&entity->lock);
> >
> > return r;
> > }
> > @@ -287,9 +286,10 @@ void drm_sched_entity_kill(struct drm_sched_entity
> > *entity)
> > /* Make sure this entity is not used by the scheduler at the moment */
> > wait_for_completion(&entity->entity_idle);
> >
> > - /* The entity is guaranteed to not be used by the scheduler */
> > - prev = rcu_dereference_check(entity->last_scheduled, true);
> > + spin_lock(&entity->lock);
> > + prev = entity->last_scheduled;
> > dma_fence_get(prev);
> > + spin_unlock(&entity->lock);
> > while ((job = drm_sched_entity_queue_pop(entity))) {
> > struct drm_sched_fence *s_fence = job->s_fence;
> >
> > @@ -381,8 +381,7 @@ void drm_sched_entity_fini(struct drm_sched_entity
> > *entity)
> > entity->dependency = NULL;
> > }
> >
> > - dma_fence_put(rcu_dereference_check(entity->last_scheduled, true));
> > - RCU_INIT_POINTER(entity->last_scheduled, NULL);
> > + dma_fence_put(entity->last_scheduled);
> > drm_sched_entity_stats_put(entity->stats);
> > }
> > EXPORT_SYMBOL(drm_sched_entity_fini);
> > @@ -507,6 +506,10 @@ drm_sched_job_dependency(struct drm_sched_job *job,
> >
> > struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity
> > *entity)
> > {
> > + /* Helper to avoid dropping the reference while the entity lock is held,
> > + * just to have some more robustness.
> > + */
>
> I don't get this comment. Neither the placement or the content.
It explains the purpose of the variable 'prev_last_scheduled', which
exists so that a reference does not drop under lock protection.
P.