On Fri, Dec 05, 2025 at 10:19:32AM +0100, Christian König wrote:
> On 12/4/25 17:04, Alex Deucher wrote:
> > On Wed, Dec 3, 2025 at 4:24 AM Philipp Stanner <[email protected]> wrote:
> >>
> >> +Cc Alex, Christian, Danilo
> >>
> >>
> >> On Mon, 2025-12-01 at 10:39 -0800, Matthew Brost wrote:
> >>> Stop open coding pending job list in drivers. Add pending job list
> >>> iterator which safely walks DRM scheduler list asserting DRM scheduler
> >>> is stopped.
> >>>
> >>> v2:
> >>>  - Fix checkpatch (CI)
> >>> v3:
> >>>  - Drop locked version (Christian)
> >>> v4:
> >>>  - Reorder patch (Niranjana)
> >>
> >> Same with the changelog.
> >>
> >>>
> >>> Signed-off-by: Matthew Brost <[email protected]>
> >>> Reviewed-by: Niranjana Vishwanathapura 
> >>> <[email protected]>
> >>> ---
> >>>  include/drm/gpu_scheduler.h | 50 +++++++++++++++++++++++++++++++++++++
> >>>  1 file changed, 50 insertions(+)
> >>>
> >>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> >>> index 385bf34e76fe..9d228513d06c 100644
> >>> --- a/include/drm/gpu_scheduler.h
> >>> +++ b/include/drm/gpu_scheduler.h
> >>> @@ -730,4 +730,54 @@ static inline bool drm_sched_job_is_signaled(struct 
> >>> drm_sched_job *job)
> >>>               dma_fence_is_signaled(&s_fence->finished);
> >>>  }
> >>>
> >>> +/**
> >>> + * struct drm_sched_pending_job_iter - DRM scheduler pending job 
> >>> iterator state
> >>> + * @sched: DRM scheduler associated with pending job iterator
> >>> + */
> >>> +struct drm_sched_pending_job_iter {
> >>> +     struct drm_gpu_scheduler *sched;
> >>> +};
> >>> +
> >>> +/* Drivers should never call this directly */
> >>> +static inline struct drm_sched_pending_job_iter
> >>> +__drm_sched_pending_job_iter_begin(struct drm_gpu_scheduler *sched)
> >>> +{
> >>> +     struct drm_sched_pending_job_iter iter = {
> >>> +             .sched = sched,
> >>> +     };
> >>> +
> >>> +     WARN_ON(!drm_sched_is_stopped(sched));
> >>> +     return iter;
> >>> +}
> >>> +
> >>> +/* Drivers should never call this directly */
> >>> +static inline void
> >>> +__drm_sched_pending_job_iter_end(const struct drm_sched_pending_job_iter 
> >>> iter)
> >>> +{
> >>> +     WARN_ON(!drm_sched_is_stopped(iter.sched));
> >>> +}
> >>> +
> >>> +DEFINE_CLASS(drm_sched_pending_job_iter, struct 
> >>> drm_sched_pending_job_iter,
> >>> +          __drm_sched_pending_job_iter_end(_T),
> >>> +          __drm_sched_pending_job_iter_begin(__sched),
> >>> +          struct drm_gpu_scheduler *__sched);
> >>> +static inline void *
> >>> +class_drm_sched_pending_job_iter_lock_ptr(class_drm_sched_pending_job_iter_t
> >>>  *_T)
> >>> +{ return _T; }
> >>> +#define class_drm_sched_pending_job_iter_is_conditional false
> >>> +
> >>> +/**
> >>> + * drm_sched_for_each_pending_job() - Iterator for each pending job in 
> >>> scheduler
> >>> + * @__job: Current pending job being iterated over
> >>> + * @__sched: DRM scheduler to iterate over pending jobs
> >>> + * @__entity: DRM scheduler entity to filter jobs, NULL indicates no 
> >>> filter
> >>> + *
> >>> + * Iterator for each pending job in scheduler, filtering on an entity, 
> >>> and
> >>> + * enforcing scheduler is fully stopped
> >>> + */
> >>> +#define drm_sched_for_each_pending_job(__job, __sched, __entity)         
> >>>     \
> >>> +     scoped_guard(drm_sched_pending_job_iter, (__sched))                 
> >>>     \
> >>> +             list_for_each_entry((__job), &(__sched)->pending_list, 
> >>> list)    \
> >>> +                     for_each_if(!(__entity) || (__job)->entity == 
> >>> (__entity))
> >>> +
> >>>  #endif
> >>
> >>
> >> See my comments in the first patch. The docu doesn't mention at all why
> >> this new functionality exists and when and why users would be expected
> >> to use it.
> >>
> >> As far as I remember from XDC, both AMD and Intel overwrite a timed out
> >> jobs buffer data in the rings on GPU reset. To do so, the driver needs
> >> the timedout job (passed through timedout_job() callback) and then
> >> needs all the pending non-broken jobs.
> >>
> >> AFAICS your patch provides a generic iterator over the entire
> >> pending_list. How is a driver then supposed to determine which are the
> >> non-broken jobs (just asking, but that needs to be documented)?
> >>
> >> Could it make sense to use a different iterator which only returns jobs
> >> of not belonging to the same context as the timedout-one?
> >>
> >> Those are important questions that need to be addressed before merging
> >> that.
> >>
> >> And if this works canonically (i.e., for basically everyone), it needs
> >> to be documented in drm_sched_resubmit_jobs() that this iterator is now
> >> the canonical way of handling timeouts.
> >>
> >> Moreover, btw, just yesterday I added an entry to the DRM todo list
> >> which addresses drm_sched_resubmit_jobs(). If we merge this, that entry
> >> would have to be removed, too.
> >>
> >>
> >> @AMD: Would the code Matthew provides work for you? Please give your
> >> input. This is very important common infrastructure.
> > 
> > I don't think drm_sched_resubmit_jobs() can work for us without major
> > rework.  For our kernel queues, we have a single queue on which jobs
> > for different clients are scheduled.  When we reset the queue, we lose
> > all jobs on the queue and have to re-emit the non-guilty ones.  We do
> > this at the ring level, i.e., we save the packets directly from the
> > ring and then re-emit the packets for the non-guilty contexts to the
> > freshly reset ring.  This avoids running run_job() again which would
> > issue new fences and race with memory management, etc.
> > 
> > I think the following would be workable:
> > 1. driver job_timedout() callback flags the job as bad. resets the bad
> > queue, and calls drm_sched_resubmit_jobs()
> > 2. drm_sched_resubmit_jobs() walks the pending list and calls
> > run_job() for every job
> 
> Calling run_job() multiple times was one of the worst ideas I have ever seen.
> 

I'm not sure who this is directed at but maybe dial back the intensity a
bit here. I really doubt this is one of the worst ideas you've ever
seen.

> The problem here is that you need a transactional approach to the internal 
> driver state which is modified by ->run_job().
> 
> So for example if you have:
> ->run_job(A)
> ->run_job(B)
> ->run_job(C)
> 
> And after a reset you find that you need to re-submit only job B and A & C 
> are filtered then that means that your driver state needs to get back before 
> running job A.
> 

This scenario isn’t possible in Xe due to the 1:1 relationship between
the scheduler and the entity. Jobs execute serially on a single ring. So
if B needs to be resubmitted, so does C. I’m not sure how AMDGPU works,
but it seems like a significant problem if this scenario can occur or
the scheduler is being misused, as AFAIK jobs submitted to a scheduler
instance should signal in order.

> > 2. driver run_job() callback looks to see if we already ran this job
> > and uses the original fence rather than allocating a new one
> 
> Nope, the problem is *all* drivers *must* use the original fence. Otherwise 
> you always run into trouble.
> 

Isn’t that what Alex is saying—always use the original fence? I fully
agree with this; Xe does the same. That’s why drm_sched_resubmit_jobs is
confusing, as the way it’s coded assumes run_job can return a different
fence than the original invocation of run_job. This is part of the
reason I didn’t use this function in Xe—it looked scary.

> We should not promote a driver interface which makes it extremely easy to 
> shoot down the whole system.
> 
> > 3. driver run_job() callback checks to see if the job is guilty or
> > from the same context and if so, sets an error on the fences and
> > submits only the fence packet to the queue so that any follow up jobs
> > will properly synchronize if they need to wait on the fence from the
> > bad job.
> > 4. driver run_job() callback will submit the full packet stream for
> > non-guilty contexts
> > 
> > I guess we could use the iterator and implement that logic in the
> > driver directly rather than using drm_sched_resubmit_jobs().
> 
> Yeah, exactly that's the way to go.
> 

I think we’re in agreement that this patch can be rebased, address any
comments here, and then be merged?

Matt

> Christian.
> 
> > 
> > Alex
> 

Reply via email to