On Mon, Oct 13, 2025 at 06:01:49PM +0100, Tvrtko Ursulin wrote:
>
> On 13/10/2025 15:39, Philipp Stanner wrote:
> > On Fri, 2025-10-03 at 10:26 +0100, Tvrtko Ursulin wrote:
> > > Drm_sched_job_add_dependency() consumes the fence reference both on
> >
> > s/D/d
>
> I cannot find the source right now but I am 99% sure someone pulled it on me
> in the past and educated me sentences should always start with a capital
> letter, even when it is not a proper word. Because in the past I was always
> doing the same as you suggest. I don't mind TBH.
Drm_sched_job_add_dependency breaks grep / searches, so I'm with Philipp s/D/d
Matt
> > > success and failure, so in the latter case the dma_fence_put() on the
> > > error path (xarray failed to expand) is a double free.
> > >
> > > Interestingly this bug appears to have been present ever since
> > > ebd5f74255b9 ("drm/sched: Add dependency tracking"), since the code back
> > > then looked like this:
> > >
> > > drm_sched_job_add_implicit_dependencies():
> > > ...
> > > for (i = 0; i < fence_count; i++) {
> > > ret = drm_sched_job_add_dependency(job, fences[i]);
> > > if (ret)
> > > break;
> > > }
> > >
> > > for (; i < fence_count; i++)
> > > dma_fence_put(fences[i]);
> > >
> > > Which means for the failing 'i' the dma_fence_put was already a double
> > > free. Possibly there were no users at that time, or the test cases were
> > > insufficient to hit it.
> > >
> > > The bug was then only noticed and fixed after
> > > 9c2ba265352a ("drm/scheduler: use new iterator in
> > > drm_sched_job_add_implicit_dependencies v2")
> > > landed, with its fixup of
> > > 4eaf02d6076c ("drm/scheduler: fix
> > > drm_sched_job_add_implicit_dependencies").
> > >
> > > At that point it was a slightly different flavour of a double free, which
> > > 963d0b356935 ("drm/scheduler: fix drm_sched_job_add_implicit_dependencies
> > > harder")
> > > noticed and attempted to fix.
> > >
> > > But it only moved the double free from happening inside the
> > > drm_sched_job_add_dependency(), when releasing the reference not yet
> > > obtained, to the caller, when releasing the reference already released by
> > > the former in the failure case.
> >
> > That's certainly interesting, but is there a specific reason why you
> > include all of that?
>
> Yes, because bug was attempted to be fixed two times already. So it is IMO
> worth having a write up on the history.
> > The code is as is, and AFAICS it's just a bug stemming from original
> > bugs present and then refactorings happening.
> >
> > I would at least remove the old 'implicit_dependencies' function from
> > the commit message. It's just confusing and makes one look for that in
> > the current code or patch.It does say "back then" and it references the
> > commit so shouldn't be
> confusing.
>
> The discussion is long for an additional reason that Fixes: targets not the
> original commit which did not get this right, but the last change which
> tried to fix it but did not. It sounded more logical to continue the chain
> on fixes but dunno. Could replace it with the original one just as well.
> > > As such it is not easy to identify the right target for the fixes tag so
> > > lets keep it simple and just continue the chain.
> > >
> > > We also drop the misleading comment about additional reference, since it
> > > is not additional but the only one from the point of view of dependency
> > > tracking.
> >
> >
> > IMO that comment is nonsense. It's useless, too, because I can *see*
> > that a reference is being taken there, but not *why*.
> >
> > Argh, these comments. See also my commit 72ebc18b34993
> >
> >
> > Anyways. Removing it is fine, but adding a better comment is better.
> > See below.
> >
> > >
> > > Signed-off-by: Tvrtko Ursulin <[email protected]>
> > > Fixes: 963d0b356935 ("drm/scheduler: fix
> > > drm_sched_job_add_implicit_dependencies harder")
> > > Reported-by: Dan Carpenter <[email protected]>
> >
> > Is there an error report that could be included here with a Closes:
> > tag?No, it did not come from any such system. Could perhaps put a link
> > to
> the report as Reference:
> https://lore.kernel.org/dri-devel/[email protected]/
>
> ?
>
> > > Cc: Christian König <[email protected]>
> > > Cc: Rob Clark <[email protected]>
> > > Cc: Daniel Vetter <[email protected]>
> > > Cc: Matthew Brost <[email protected]>
> > > Cc: Danilo Krummrich <[email protected]>
> > > Cc: Philipp Stanner <[email protected]>
> > > Cc: "Christian König" <[email protected]>
> > > Cc: [email protected]
> > > Cc: <[email protected]> # v5.16+
> > > ---
> > > drivers/gpu/drm/scheduler/sched_main.c | 14 +++++---------
> > > 1 file changed, 5 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > index 46119aacb809..aff34240f230 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -960,20 +960,16 @@ int drm_sched_job_add_resv_dependencies(struct
> > > drm_sched_job *job,
> > > {
> > > struct dma_resv_iter cursor;
> > > struct dma_fence *fence;
> > > - int ret;
> > > + int ret = 0;
> > > dma_resv_assert_held(resv);
> > > dma_resv_for_each_fence(&cursor, resv, usage, fence) {
> > > - /* Make sure to grab an additional ref on the added fence */
> > > - dma_fence_get(fence);
> > > - ret = drm_sched_job_add_dependency(job, fence);
> > > - if (ret) {
> > > - dma_fence_put(fence);
> > > - return ret;
> > > - }
> > > + ret = drm_sched_job_add_dependency(job, dma_fence_get(fence));
> >
> > You still take a reference as before, but there is no comment anymore.
> > Can you add one explaining why a new reference is taken here?
> >
> > I guess it will be something like "This needs a new reference for the
> > job", since you cannot rely on the one from resv.
>
> I was mulling it over but then thought the kerneldoc for
> drm_sched_job_add_dependency() is good enough since it explains the function
> always consumes the reference and dma_resv_for_each_fence() does not hold
> one over the iteration. I can add it.
>
> >
> > > + if (ret)
> > > + break;
> > > }
> > > - return 0;
> > > + return ret;
> >
> >
> > That's an unnecessarily enlargement of the git diff because of style,
> > isn't it? Better keep the diff minimal here for git blame.
>
> Given the relative size of the diff versus the function itself it did not
> look like a big deal to reduce to one return statement while touching it,
> but I can keep two returns just as well, no problem.
>
> Regards,
>
> Tvrtko
>