Hi Christian,

On 11/07/25 12:20, Christian König wrote:
On 11.07.25 15:37, Philipp Stanner wrote:
On Fri, 2025-07-11 at 15:22 +0200, Christian König wrote:


On 08.07.25 15:25, Maíra Canal wrote:
When the DRM scheduler times out, it's possible that the GPU isn't hung;
instead, a job just took unusually long (longer than the timeout) but is
still running, and there is, thus, no reason to reset the hardware. This
can occur in two scenarios:

   1. The job is taking longer than the timeout, but the driver determined
      through a GPU-specific mechanism that the hardware is still making
      progress. Hence, the driver would like the scheduler to skip the
      timeout and treat the job as still pending from then onward. This
      happens in v3d, Etnaviv, and Xe.
   2. Timeout has fired before the free-job worker. Consequently, the
      scheduler calls `sched->ops->timedout_job()` for a job that isn't
      timed out.

These two scenarios are problematic because the job was removed from the
`sched->pending_list` before calling `sched->ops->timedout_job()`, which
means that when the job finishes, it won't be freed by the scheduler
though `sched->ops->free_job()` - leading to a memory leak.

Yeah, that is unfortunately intentional.

To solve these problems, create a new `drm_gpu_sched_stat`, called
DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip the reset. The
new status will indicate that the job must be reinserted into
`sched->pending_list`, and the hardware / driver will still complete that
job.

Well long story short we have already tried this and the whole approach doesn't 
work correctly in all cases. See the git history around how we used to destroy 
the jobs.

The basic problem is that you can always race between timing out and 
Signaling/destroying the job. This is the long lasting job lifetime issue we 
already discussed more than once.

The scheduler destroys the job, through free_job().
I think we have agreed that for now the scheduler is the party
responsible for the job lifetime.

That's what I strongly disagree on. The job is just a state bag between the 
submission and scheduling state of a submission.

For the scheduler the control starts when it is pushed into the entity and ends 
when run_job is called.

The real representation of the submission is the scheduler fence and that 
object has a perfectly defined lifetime, state and error handling.


If you want to fix this I think the correct approach is to completely drop 
tracking jobs in the scheduler at all.

I don't see how this series introduces a problem?

The fact is that drivers are abusing the API by just firing jobs back
into the scheduler's job list. This series legalizes the abuse by
providing scheduler functionality for that.

IOW, the series improves the situation but does not add a *new*
problem. Even less so as driver's aren't forced to use the new status
code, but can continue having job completion race with timeout
handlers.

Maybe yes, but I'm really not sure about it.

Take a look at the git history or job destruction, we already had exactly that 
approach, removed it and said that leaking memory is at least better than an 
use after free issue.


If the job was removed from the pending list in the beginning of the
timeout and drm_sched_get_finished_job() fetches jobs from the pending
list, how can we end up with an use-after-free issue?

Best Regards,
- Maíra

Reply via email to