Re: [RFC PATCH] dma-fence: Fix races of fence callbacks versus destructors by locking

Philipp Stanner Thu, 11 Jun 2026 02:50:58 -0700

On Thu, 2026-06-11 at 11:14 +0200, Christian König wrote:
> On 6/11/26 10:35, Philipp Stanner wrote:
> > On Wed, 2026-06-10 at 17:15 +0200, Christian König wrote:
> > > On 6/10/26 16:25, Philipp Stanner wrote:
> > > > 
> > > > Are you referring to this comment from the documentation?
> > > > 
> > > >  * Since many implementations can call dma_fence_signal() even
> > > > when before
> > > >  * @enable_signaling has been called there's a race window,
> > > > where the
> > > >  * dma_fence_signal() might result in the final fence reference
> > > > being
> > > >  * released and its memory freed. To avoid this,
> > > > implementations of this
> > > >  * callback should grab their own reference using
> > > > dma_fence_get(), to be
> > > >  * released when the fence is signalled (through e.g. the
> > > > interrupt
> > > >  * handler).
> > > >  *
> > > >  * This callback is optional. If this callback is not present,
> > > > then the
> > > >  * driver must always have signaling enabled.
> > > >  */
> > > >  bool (*enable_signaling)(struct dma_fence *fence);
> > > 
> > > Yes, that was an extremely bad idea which I have tried multiple
> > > times to fix.
> > 
> > What's the reason why enable_signaling() is nowadays called with
> > lock-
> > protection?
> 
> That was some decision originally made a long long time ago because
> the initial thought was that fence need to signal in order, but that
> concept was also abandoned a long long time ago.


You also need it to avoid a race with the driver's requested signalling
in ops->enable_signalling(), through dma_fence_add_callback().

I pointed that out in my answer. You cut it out and did not react to
that statement.

> 
> Fixing this is on my TODO list ever since we figured out that this
> was a bad idea, see the latest patch set here as well:
> https://www.spinics.net/lists/dri-devel/msg461253.html
> 
> > > > Sure you can. If everything is properly synchronized.
> > > > 
> > > > // driver
> > > > dma_fence_signal(f);
> > > > // all callbacks can't reach our driver anymore
> > > 
> > > That's irrelevant. The question is not if a callback can reach
> > > the backend after signaling.
> > 
> > Irrelevant for your lock-inversion maybe. It's very relevant for
> > life
> > time and module unload. Proof: you and Tvrtko made
> > dma_fence_signal()
> > the decoupling point, with RCU protection.
> > 
> > And as I keep saying, that synchronisation point would make the
> > entire
> > framework simpler and more robust if we were making it consistent.
> 
> Exactly that's what I strongly disagree on.

Yeah, and you don't provide any rationale AFAICS. The argument
basically is "this is how it is, and it is how it is because it is like
that". It's "this is done with RCU instead of locks, because you can
use RCU for that".

No offense intended, but I just can't see any substantial argument
except for non-existing deadlocks that COULD ALREADY OCCUR and have to
be mitigated by the drivers already.

> 
> The fence lock and the synchronization point which allows the driver
> to know when nobody else is in a callback any more are two very
> different things we should not be mixed up together.

Why? It's literally what locks exist for. For synchronization.

RCU has been literally developed for deadlock avoidance, which its
usage in dma_fence provably doesn't achieve. You still have the
deadlock problem, and on top of that you have redundant, unnecessary
RCU calls.

> 
> The fence lock protects the signaled state of the fence, so that the
> fence functions can add callbacks, test signaling etc... while the
> fence can't signal. It is protecting internal state of the fence and
> so should be internal to the fence, external code should not touch it
> if possible.

You allow people to take the lock manually with the API function
dma_fence_lock_irqsave().

But you know what, I absolutely agree!

I think if we take the locks for the callbacks and remove
dma_fence_set_error(), then the need to ever touch the lock manually
disappears.

Then the driver knows what I pointed out at least three times now:

dma_fence_signal(f);
cleanup(f); // perfectly safe, bc after signal() all accesors are done

No RCU, no nothing. Just plain proper synchronization.

But when I point that out, you say "No no, you can't do that, that's
not the purpose of the lock, you have to do:

dma_fence_signal(f);
call_rcu(cleanup(f));
"

???

> 
> The synchronization point for the driver is a service the dma_fence
> implementation offers to let drivers know when there is no more
> caller of their function. And usually this is implemented using
> RCU/SRCU.

"this is how it is because it is how it is".

The one and only reason why you need RCU is that you don't lock
consistently.

Using RCU everywhere is nothing but symptom treatment.

> 
> Tvrko and I now choose RCU instead of SRCU because there didn't seem
> to be a need to sleep in the fence callbacks.

You can't sleep in the callbacks, because the callbacks are invoked
with the fence spinlock being held.

> 
> But both RCU and SRCU offer not only synchronize_rcu()/_srcu() but
> also call_rcu()/_srcu() which allows drivers to delegate the cleanup
> to a point where it is save to do so.

"We use RCU instead of locks because we can use RCU instead of locks".
This is not an argument against locks, Christian.

If you lock consistently, cleaning up after dma_fence_signal() is
guaranteed to be safe to do, immediately.

> 
> So while I see the problem I actually don't understand why you insist
> of solving it with the fence lock?

Well, that's then a philosophical moment, because I on the contrary
don't understand why you insist on solving it with RCU :D

The tl;dr is: using the locks makes the entire implementation more
robust and consistent and allows for removing the RCU delays.

I showed you a bug that would have been prevented with locks. It was
caused by the improper synchronization, where dma_fence_is_signaled()
can return true when the callbacks have not yet been run.

The solution is to use

dma_fence_lock_irqsave(f);
if (dma_fence_is_signaled_locked(f))
  cleanup(f);
dma_fence_unlock_irqrestore(f);

IOW… the solution is to call the ops->signaled() callback with the lock
held……… notice something? ;)

I provided more rationale on the rest of my mail, but you cut that out.
Here, let me repost it:


"
You need the lock for synchronization. To add the callback.

Basically, you came quite naturally to the same conclusion and solution
as I did: you need / want race-free synchronization.

It's the exact same use-case, really. You want to prevent someone from
interfering with the fence while the ops->callback is running. You need
the lock for that, by definition.

The other two relevant callbacks, signaled() and set_deadline(),
already have the same lock protection guarantee. The former through
dma_fence_is_signaled_locked(), the latter through  the fact that
everyone takes the lock immediately after callback begin anyways.
"

and

"
All we have to do is make the existing rules official and the code a
bit more consistent.
"


The deadlock argument doesn't hold up because the lock is already being
taken for enable_signalling() and signaled().

I showed you one Nouveau bug which the locks would have prevented.

Furthermore, I argue that it's impossible for you to remove the lock
protection from ops->enable_signalling().

I further argue that consistent lock protection would allow for
removing the need for dma_fence_is_signaled_locked() and
dma_fence_lock_irqsave() as public APIs.


>  That just brings us back to the bad design we had before where
> internal fence state is abused to protect something else.

If we apply my design, we could deprecate dma_fence_lock_irqsave() and
thereby declare it illegal to do that.

Relying on dma_fence_signal() being a point of synchronization is
completely different from using its internal lock to protect your own
list or tree or whatever.


P.

> 
> Regards,
> Christian.

Re: [RFC PATCH] dma-fence: Fix races of fence callbacks versus destructors by locking

Reply via email to