On 22/10/2025 15:03, Danilo Krummrich wrote:
On Wed Oct 22, 2025 at 3:50 PM CEST, Tvrtko Ursulin wrote:
Yes, for the case when entity joins the run-queue it can be the same
entity which is now the head of the queue, or it can be a different one.
Depends on the insertion position.

But for the case where entity is leaving the run queue it is always a
different entity and therefore a lock inversion. We have essentially this:

lock entity
lock rq
remove entity from the rq
rq->prio = rq->head_entity->prio // different entity, unlocked read
unlock rq
unlock entity

This sounds like it repeates the unclear locking situation that is also
documented for struct drm_sched_rq:

         * FIXME: Locking is very unclear for this. Writers are protected by
         * @lock, but readers are generally lockless and seem to just race with
         * not even a READ_ONCE.

This sounds pretty suspicious to me and I think it indicates a more fundamental
design issue that you now end up working around now.

I'm afraid it is not nearly the same. Guarantee that entity->rq is stable is a multi-step one which depends on the job queue being non empty and the last submitted job not being signalled. That side even includes a smp_rmb() in drm_sched_entity_select_rq(). Code which does the suspicious unlocked entity->rq access therefore claims to be certain one or both of those conditions must be true.

What I am doing here is way, way simpler and IMO should not controversial. It is well defined that entities can only enter and exit the run queue with the rq->lock held. Which the code path holds, and the functions asserts for. So a lockless read of an integer is nowhere near the complexities of the FIXME you quote.

I'd like to dig in a bit more, but unfortunately it's very unlikely I will have
the time to do this until after LPC.

Should I interpret this as putting a blocker on the series until effectively 2026?

Regards,

Tvrtko

Reply via email to