On 23/06/2026 15:48, Philipp Stanner wrote:
On Tue, 2026-06-23 at 15:05 +0100, Tvrtko Ursulin wrote:
There we have this sequence:
spin_lock(&entity->lock);
entity->stopped = true;
sched = drm_sched_rq_remove_entity(entity->rq, entity);
spin_unlock(&entity->lock);
if (sched)
drm_sched_flush_run_work(sched);
That is, without that check, in theory, an evil driver could race
drm_sched_entity_select_rq() (via drm_sched_job_arm()) and
drm_sched_entity_kill(). I am not sure if any driver can actually do
that at the moment but it felt sensible to express it in code.
The devil is in the "at the moment".
AFAICS and as your explanation sounds, this is an existing problem that
is unrelated to the Flush RFC. So I suppose that this should be a
separate patch (independent from this one) which precisely focusses on
this robustness work.
Catch is, and why I thought it is justified to do this it in this patch,
is because before completion was in the entity so whats happening with
entity->rq->sched after entity is stopped was irrelevant. With this
patch is it relevant on paper so I considered it prudent to express that
in the code as well as the comment.
I'm not sure if I can fully follow, but it seems that the stopped
boolean + spinlock now protects the scheduler pointer?
IDK, I don't feel comfortable right now.. New docu says
drm_sched_rq_add_entity(struct drm_sched_entity *entity)
* @entity: scheduler entity
*
* Removes a scheduler entity from the run queue.
+ *
+ * Return: DRM scheduler selected to handle this entity or NULL if entity has
+ * already been removed.
*/
Isn't a removed entity by definition stopped since drm_sched can't
access it anymore? Or do we have race there, too?
If so, then stopped and entity->rq->sched == NULL represent the same
state. IOW, the locked bool must be redundant with something.
Right?
No, they are separate things. Removed entity is a normal transitional
state that happens at runtime. Before any jobs have yet been submitted,
or when drm_sched_entity_select_rq() changes the selected scheduler it
removes the entity from the old run queue.
Regards,
Tvrtko
Acceptable or unacceptable?
In any case such changes should, wherever possible, be a separate
patch. For the fact alone that the distinct commit message and
especially the position / order of the patch within the series helps
the reviewer greatly in understanding what that particular change
enables and thus why it's necessary.
P.