On 12/11/2025 12:47, Boris Brezillon wrote: > On Fri, 7 Nov 2025 16:40:53 +0000 > Steven Price <[email protected]> wrote: > >> On 31/10/2025 15:48, Boris Brezillon wrote: >>> A group can become runnable even after reset.in_progress has >>> been set to true and panthor_sched_suspend() has been called, >>> because the drm_sched queues are still running at that point, >>> and ::run_job() might call group_schedule_locked() which moves >>> the group to the runnable list. And that's fine, because we're >>> moving those groups to the stopped list anyway when we call >>> panthor_group_stop(), so just drop the misleading WARN_ON(). >> >> If we've got another thread mutating the runnable list between >> panthor_sched_suspend() and list_for_each_entry_safe(), doesn't that >> make the list iterator unsafe? (_safe only protects against deleting the >> current item, not against concurrent access). > > I'm not too sure actually. There's an > atomic_read(&sched->reset.in_progress) to check if we're about to reset > in group_schedule_locked() and cancel the insertion into the runnable > list in that case, meaning we're sure nothing new will be inserted after > we've set the in_progress=true in panthor_sched_pre_reset().
I was mostly going on your commit message: > A group can become runnable even after reset.in_progress has > been set to true and panthor_sched_suspend() has been called if that is indeed happening then we have a problem (and removing the WARN_ON is just papering over it). I haven't actually followed through the logic. >> >> It feels to me like we should be holding the sched mutex - at least >> while iterating. I agree the WARN_ON is unnecessary, and will need >> removing if we simply guard the iteration - the alternative is to >> recolour panthor_sched_suspend() to assume the lock is held (and take >> the lock in panthor_sched_pre_reset), but I suspect that's a more ugly >> change. > > I'd rather ensure that nothing new is inserted in the runnable/idle > lists after sched->reset.in_progress is set to true. Note that > sched->reset.in_progress is set to true with the sched lock held, > meaning any path modifying the sched lists (must be done with the sched > lock held) should complete before we set this to true. As long as those > paths also skip the list insertion, or, for things happening in a work > context (thinking of the tick work here), as long as the work is not > rescheduled until we get a chance to disable this work, we should be > good, no? Yes that design can work. But atomics can be difficult to reason about, so unless there's a good reason I think we'd generally be better sticking with (simple) locks on the slow paths, then we get the benefits of lockdep etc checking we haven't messed up. Thanks, Steve
