+Cc Christian, Alex, Danilo

On Mon, 2025-12-01 at 10:39 -0800, Matthew Brost wrote:
> Add helpers to see if scheduler is stopped and a jobs signaled state.
> Expected to be used driver side on recovery and debug flows.

First, thanks for working on this.

This is a big and significant change because it moves towards ending
the 10-year practice of accessing internal locks etc. – I think this
should have a long(er) and detailed commit message aka "In the past
drivers used to … this must end because … to do so we need to provide
those new functions: …"

> 
> v4:
>  - Reorder patch to first in series (Niranjana)
>  - Also check parent fence for signaling (Niranjana)

"We" mostly agreed of not adding changelogs to commit messages anymore
and either have them in the cover letter or in the patche's comment
section below ---
The commit changelog comments are not canonical in the kernel and don't
provide any value IMO.

> 
> Signed-off-by: Matthew Brost <[email protected]>
> Reviewed-by: Niranjana Vishwanathapura <[email protected]>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c |  4 ++--
>  include/drm/gpu_scheduler.h            | 32 ++++++++++++++++++++++++++
>  2 files changed, 34 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 1d4f1b822e7b..cf40c18ab433 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -344,7 +344,7 @@ drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler 
> *sched,
>   */
>  static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>  {
> -     if (!READ_ONCE(sched->pause_submit))
> +     if (!drm_sched_is_stopped(sched))
>               queue_work(sched->submit_wq, &sched->work_run_job);
>  }
>  
> @@ -354,7 +354,7 @@ static void drm_sched_run_job_queue(struct 
> drm_gpu_scheduler *sched)
>   */
>  static void drm_sched_run_free_queue(struct drm_gpu_scheduler *sched)
>  {
> -     if (!READ_ONCE(sched->pause_submit))
> +     if (!drm_sched_is_stopped(sched))
>               queue_work(sched->submit_wq, &sched->work_free_job);
>  }
>  
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index fb88301b3c45..385bf34e76fe 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -698,4 +698,36 @@ void drm_sched_entity_modify_sched(struct 
> drm_sched_entity *entity,
>                                  struct drm_gpu_scheduler **sched_list,
>                                  unsigned int num_sched_list);
>  
> +/* Inlines */

Surplus comment, everyone immediately sees by the keyword that the
functions are inline.

But why do you want to provide them here instead of in sched_main.c in
the first place?


> +
> +/**
> + * drm_sched_is_stopped() - DRM is stopped

Well no, I doubt the entire DRM subsystem is stopped ;)

"Checks whether drm_sched is stopped"

> + * @sched: DRM scheduler
> + *
> + * Return: True if sched is stopped, False otherwise
> + */
> +static inline bool drm_sched_is_stopped(struct drm_gpu_scheduler *sched)
> +{
> +     return READ_ONCE(sched->pause_submit);

I am by the way suspecting since a long time

> +}
> +
> +/**
> + * drm_sched_job_is_signaled() - DRM scheduler job is signaled
> + * @job: DRM scheduler job
> + *
> + * Determine if DRM scheduler job is signaled. DRM scheduler should be 
> stopped
> + * to obtain a stable snapshot of state. Both parent fence (hardware fence) 
> and
> + * finished fence (software fence) are check to determine signaling state.

s/check/checked

I can roughly understand why you need the start/stop checkers for your
list iterator, but what is this function's purpose? The commit message
should explain that.

Do you need them in Xe? Do all drivers need them?

I think it's very cool that you provide this series and are working on
all that, but at XDC I think the important point was that we determined
that AMD and Intel basically do the same trick for GPU resets.

So our desire was not only to prevent folks from accessing the
scheduler's internals, but, ideally, also provide a well documented,
centralized and canonical mechanisms to do GPU resets.

So I think this drm/sched code must be discussed with AMD and we should
see whether it would be sufficient for them, too. And if yes, we need
to properly document that new way of GPU resets and tell users what
those functions are for. The docstrings so far just highlight that
those functions exist and how they are used, but not *why* they exist.

> + *
> + * Return: True if job is signaled, False otherwise

True and False should be lower case I think. At least I've never seen
them upper case in docstrings so far?


P.

> + */
> +static inline bool drm_sched_job_is_signaled(struct drm_sched_job *job)
> +{
> +     struct drm_sched_fence *s_fence = job->s_fence;
> +
> +     WARN_ON(!drm_sched_is_stopped(job->sched));
> +     return (s_fence->parent && dma_fence_is_signaled(s_fence->parent)) ||
> +             dma_fence_is_signaled(&s_fence->finished);
> +}
> +
>  #endif

Reply via email to