On Fri, Dec 05, 2025 at 10:19:32AM +0100, Christian König wrote: > On 12/4/25 17:04, Alex Deucher wrote: > > On Wed, Dec 3, 2025 at 4:24 AM Philipp Stanner <[email protected]> wrote: > >> > >> +Cc Alex, Christian, Danilo > >> > >> > >> On Mon, 2025-12-01 at 10:39 -0800, Matthew Brost wrote: > >>> Stop open coding pending job list in drivers. Add pending job list > >>> iterator which safely walks DRM scheduler list asserting DRM scheduler > >>> is stopped. > >>> > >>> v2: > >>> - Fix checkpatch (CI) > >>> v3: > >>> - Drop locked version (Christian) > >>> v4: > >>> - Reorder patch (Niranjana) > >> > >> Same with the changelog. > >> > >>> > >>> Signed-off-by: Matthew Brost <[email protected]> > >>> Reviewed-by: Niranjana Vishwanathapura > >>> <[email protected]> > >>> --- > >>> include/drm/gpu_scheduler.h | 50 +++++++++++++++++++++++++++++++++++++ > >>> 1 file changed, 50 insertions(+) > >>> > >>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > >>> index 385bf34e76fe..9d228513d06c 100644 > >>> --- a/include/drm/gpu_scheduler.h > >>> +++ b/include/drm/gpu_scheduler.h > >>> @@ -730,4 +730,54 @@ static inline bool drm_sched_job_is_signaled(struct > >>> drm_sched_job *job) > >>> dma_fence_is_signaled(&s_fence->finished); > >>> } > >>> > >>> +/** > >>> + * struct drm_sched_pending_job_iter - DRM scheduler pending job > >>> iterator state > >>> + * @sched: DRM scheduler associated with pending job iterator > >>> + */ > >>> +struct drm_sched_pending_job_iter { > >>> + struct drm_gpu_scheduler *sched; > >>> +}; > >>> + > >>> +/* Drivers should never call this directly */ > >>> +static inline struct drm_sched_pending_job_iter > >>> +__drm_sched_pending_job_iter_begin(struct drm_gpu_scheduler *sched) > >>> +{ > >>> + struct drm_sched_pending_job_iter iter = { > >>> + .sched = sched, > >>> + }; > >>> + > >>> + WARN_ON(!drm_sched_is_stopped(sched)); > >>> + return iter; > >>> +} > >>> + > >>> +/* Drivers should never call this directly */ > >>> +static inline void > >>> +__drm_sched_pending_job_iter_end(const struct drm_sched_pending_job_iter > >>> iter) > >>> +{ > >>> + WARN_ON(!drm_sched_is_stopped(iter.sched)); > >>> +} > >>> + > >>> +DEFINE_CLASS(drm_sched_pending_job_iter, struct > >>> drm_sched_pending_job_iter, > >>> + __drm_sched_pending_job_iter_end(_T), > >>> + __drm_sched_pending_job_iter_begin(__sched), > >>> + struct drm_gpu_scheduler *__sched); > >>> +static inline void * > >>> +class_drm_sched_pending_job_iter_lock_ptr(class_drm_sched_pending_job_iter_t > >>> *_T) > >>> +{ return _T; } > >>> +#define class_drm_sched_pending_job_iter_is_conditional false > >>> + > >>> +/** > >>> + * drm_sched_for_each_pending_job() - Iterator for each pending job in > >>> scheduler > >>> + * @__job: Current pending job being iterated over > >>> + * @__sched: DRM scheduler to iterate over pending jobs > >>> + * @__entity: DRM scheduler entity to filter jobs, NULL indicates no > >>> filter > >>> + * > >>> + * Iterator for each pending job in scheduler, filtering on an entity, > >>> and > >>> + * enforcing scheduler is fully stopped > >>> + */ > >>> +#define drm_sched_for_each_pending_job(__job, __sched, __entity) > >>> \ > >>> + scoped_guard(drm_sched_pending_job_iter, (__sched)) > >>> \ > >>> + list_for_each_entry((__job), &(__sched)->pending_list, > >>> list) \ > >>> + for_each_if(!(__entity) || (__job)->entity == > >>> (__entity)) > >>> + > >>> #endif > >> > >> > >> See my comments in the first patch. The docu doesn't mention at all why > >> this new functionality exists and when and why users would be expected > >> to use it. > >> > >> As far as I remember from XDC, both AMD and Intel overwrite a timed out > >> jobs buffer data in the rings on GPU reset. To do so, the driver needs > >> the timedout job (passed through timedout_job() callback) and then > >> needs all the pending non-broken jobs. > >> > >> AFAICS your patch provides a generic iterator over the entire > >> pending_list. How is a driver then supposed to determine which are the > >> non-broken jobs (just asking, but that needs to be documented)? > >> > >> Could it make sense to use a different iterator which only returns jobs > >> of not belonging to the same context as the timedout-one? > >> > >> Those are important questions that need to be addressed before merging > >> that. > >> > >> And if this works canonically (i.e., for basically everyone), it needs > >> to be documented in drm_sched_resubmit_jobs() that this iterator is now > >> the canonical way of handling timeouts. > >> > >> Moreover, btw, just yesterday I added an entry to the DRM todo list > >> which addresses drm_sched_resubmit_jobs(). If we merge this, that entry > >> would have to be removed, too. > >> > >> > >> @AMD: Would the code Matthew provides work for you? Please give your > >> input. This is very important common infrastructure. > > > > I don't think drm_sched_resubmit_jobs() can work for us without major > > rework. For our kernel queues, we have a single queue on which jobs > > for different clients are scheduled. When we reset the queue, we lose > > all jobs on the queue and have to re-emit the non-guilty ones. We do > > this at the ring level, i.e., we save the packets directly from the > > ring and then re-emit the packets for the non-guilty contexts to the > > freshly reset ring. This avoids running run_job() again which would > > issue new fences and race with memory management, etc. > > > > I think the following would be workable: > > 1. driver job_timedout() callback flags the job as bad. resets the bad > > queue, and calls drm_sched_resubmit_jobs() > > 2. drm_sched_resubmit_jobs() walks the pending list and calls > > run_job() for every job > > Calling run_job() multiple times was one of the worst ideas I have ever seen. >
I'm not sure who this is directed at but maybe dial back the intensity a bit here. I really doubt this is one of the worst ideas you've ever seen. > The problem here is that you need a transactional approach to the internal > driver state which is modified by ->run_job(). > > So for example if you have: > ->run_job(A) > ->run_job(B) > ->run_job(C) > > And after a reset you find that you need to re-submit only job B and A & C > are filtered then that means that your driver state needs to get back before > running job A. > This scenario isn’t possible in Xe due to the 1:1 relationship between the scheduler and the entity. Jobs execute serially on a single ring. So if B needs to be resubmitted, so does C. I’m not sure how AMDGPU works, but it seems like a significant problem if this scenario can occur or the scheduler is being misused, as AFAIK jobs submitted to a scheduler instance should signal in order. > > 2. driver run_job() callback looks to see if we already ran this job > > and uses the original fence rather than allocating a new one > > Nope, the problem is *all* drivers *must* use the original fence. Otherwise > you always run into trouble. > Isn’t that what Alex is saying—always use the original fence? I fully agree with this; Xe does the same. That’s why drm_sched_resubmit_jobs is confusing, as the way it’s coded assumes run_job can return a different fence than the original invocation of run_job. This is part of the reason I didn’t use this function in Xe—it looked scary. > We should not promote a driver interface which makes it extremely easy to > shoot down the whole system. > > > 3. driver run_job() callback checks to see if the job is guilty or > > from the same context and if so, sets an error on the fences and > > submits only the fence packet to the queue so that any follow up jobs > > will properly synchronize if they need to wait on the fence from the > > bad job. > > 4. driver run_job() callback will submit the full packet stream for > > non-guilty contexts > > > > I guess we could use the iterator and implement that logic in the > > driver directly rather than using drm_sched_resubmit_jobs(). > > Yeah, exactly that's the way to go. > I think we’re in agreement that this patch can be rebased, address any comments here, and then be merged? Matt > Christian. > > > > > Alex >
