Hi Daniel,

> On 21.08.2024 11:37, Daniel Almeida wrote:
> It can be advantageous for userspace to have access to successful jobs.

While it's true that perhaps having additional jobs as part of the same 
devcoredump file
could provide further information as to why a later job failed, I see a few 
snags with this
approach:

- The time since the debugfs knob is triggered and therefore some successful 
jobs dumped
until a later job fails might be very long, so the preceding jobs maybe won't 
provide
much context.
- Besides being mostly interested in immediately preceding jobs, I think we'd 
want these
to belong to the same scheduling group and VM as the failing job, but this 
approach
will dump them all consecutively, even if they belong to a different open DRM 
file.
- In my experience writing a similar feature for the Panfrost driver, I think 
it's best
to wait until users of the driver run into specific bugs for us to come up with 
debug
features that would be useful for them, rather than sort of trying to guess 
them instead,
because there's the risk they'll never be used and then just add cruft into the 
codebase.

Other than that, the preceding patches look absolutely gorgeous to me, so I
think it's best if you resubmit them, and maybe keep patches 3-5 on hold until
we run into a bug scenario where they might prove useful.

Cheers,
Adrian

> Allow this as an opt-in through a debugfs file.
> 
> Note that the devcoredump infrastructure will discard new dumps if a
> previous dump hasn't been read. A future patch will add support for
> multi-job dumps which will work around this limitation.
> 
> Signed-off-by: Daniel Almeida <[email protected]>
> ---
>  drivers/gpu/drm/panthor/panthor_sched.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c 
> b/drivers/gpu/drm/panthor/panthor_sched.c
> index afd644c7d9f1..ea2696c1075a 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -10,6 +10,7 @@
>  
>  #include <linux/build_bug.h>
>  #include <linux/clk.h>
> +#include <linux/debugfs.h>
>  #include <linux/delay.h>
>  #include <linux/dma-mapping.h>
>  #include <linux/dma-resv.h>
> @@ -317,6 +318,9 @@ struct panthor_scheduler {
>                */
>               struct list_head stopped_groups;
>       } reset;
> +
> +     /** @dump_successful_jobs: whether to dump successful jobs through 
> coredumpv */
> +     bool dump_successful_jobs;
>  };
>  
>  /**
> @@ -2946,6 +2950,16 @@ queue_run_job(struct drm_sched_job *sched_job)
>       queue->iface.input->extract = queue->iface.output->extract;
>       queue->iface.input->insert = job->ringbuf.end;
>  
> +     if (sched->dump_successful_jobs) {
> +             struct panthor_core_dump_args core_dump_args = {
> +                     .ptdev = ptdev,
> +                     .group_vm = job->group->vm,
> +                     .group = job->group,
> +             };
> +
> +             panthor_core_dump(&core_dump_args);
> +     }
> +
>       if (group->csg_id < 0) {
>               /* If the queue is blocked, we want to keep the timeout 
> running, so we
>                * can detect unbounded waits and kill the group when that 
> happens.
> @@ -3609,5 +3623,8 @@ void panthor_sched_debugfs_init(struct drm_minor *minor)
>       struct panthor_device *ptdev =
>               container_of(minor->dev, struct panthor_device, base);
>       struct panthor_scheduler *sched = ptdev->scheduler;
> +
> +     debugfs_create_bool("dump_successful_jobs", 0644, minor->debugfs_root,
> +                         &sched->dump_successful_jobs);
>  }
>  #endif /* CONFIG_DEBUG_FS */
> -- 
> 2.45.2

Reply via email to