+Cc Lyude, Danilo

On Thu, 2025-11-20 at 15:41 +0100, Christian König wrote:
> Exceeding the recommended maximum timeout should be noted in logs and
> crash dumps.
> 
> Signed-off-by: Christian König <[email protected]>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 1d4f1b822e7b..88e24e140def 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1318,12 +1318,22 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, 
> const struct drm_sched_init_
>       sched->ops = args->ops;
>       sched->credit_limit = args->credit_limit;
>       sched->name = args->name;
> -     sched->timeout = args->timeout;
>       sched->hang_limit = args->hang_limit;
>       sched->timeout_wq = args->timeout_wq ? args->timeout_wq : 
> system_percpu_wq;
>       sched->score = args->score ? args->score : &sched->_score;
>       sched->dev = args->dev;
>  
> +     sched->timeout = args->timeout;
> +     if (sched->timeout > DMA_FENCE_MAX_REASONABLE_TIMEOUT) {
> +             dev_warn(sched->dev, "Timeout %ld exceeds the maximum 
> recommended one!\n",
> +                      sched->timeout);
> +             /*
> +              * Make sure that exceeding the recommendation is noted in
> +              * logs and crash dumps.
> +              */
> +             add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
> +     }
> +


I have to NACK this in the current form, it would cause a bunch of
drivers to fire warnings, despite there being absolutely nothing wrong
with them in the past

https://elixir.bootlin.com/linux/v6.18-rc6/source/drivers/gpu/drm/nouveau/nouveau_sched.c#L412
https://elixir.bootlin.com/linux/v6.18-rc6/source/drivers/gpu/drm/lima/lima_sched.c#L519

I guess there are more.

Nouveau's current timeout is an astonishing 10 seconds, and AFAIK there
has never been a problem with that. If you want to declare this
behavior invalid, you need to discuss that with the Nouveau maintainers
first.

It also didn't become clear to me why dma_fence is to define a timeout
rule? I like to think that "must be signalled within reasonable time"
is as precise as it gets. As demonstrated by the drivers, there is just
no objectively correct definiton of "reasonable".

BTW your series doesn't make clear to me why you only touch very few
components: there are many more users of dma_fence than just vgem and
sched. What about the others?


P.

Reply via email to