Re: [RFC PATCH] dma-fence: Fix races of fence callbacks versus destructors by locking

Philipp Stanner Thu, 11 Jun 2026 01:35:51 -0700

On Wed, 2026-06-10 at 17:15 +0200, Christian König wrote:
> On 6/10/26 16:25, Philipp Stanner wrote:
> > 
> > Are you referring to this comment from the documentation?
> > 
> >  * Since many implementations can call dma_fence_signal() even when before
> >  * @enable_signaling has been called there's a race window, where the
> >  * dma_fence_signal() might result in the final fence reference being
> >  * released and its memory freed. To avoid this, implementations of this
> >  * callback should grab their own reference using dma_fence_get(), to be
> >  * released when the fence is signalled (through e.g. the interrupt
> >  * handler).
> >  *
> >  * This callback is optional. If this callback is not present, then the
> >  * driver must always have signaling enabled.
> >  */
> >  bool (*enable_signaling)(struct dma_fence *fence);
> 
> Yes, that was an extremely bad idea which I have tried multiple times to fix.


What's the reason why enable_signaling() is nowadays called with lock-
protection?

> 
> > 
> > > and delegates the signaling to the caller instead of doing it itself.
> > 
> > Who delegates what to whom?
> > 
> > enable_signaling() is called indirectly by someone who adds a callback.
> > If enable_signaling() is implemented, then the driver callback's only
> > job is to activate some sort of interrupt or worker that will signal
> > that fence at some point.
> 
> No, enable_signal also returns if enabling was successfully if it wasn't 
> successfully the dma_fence framework signals the fence.  
> 
> > > This in turn means that you can't do proper cleanup after the
> > > signaling is done by grabbing driver specific locks.
> > 
> > Sure you can. If everything is properly synchronized.
> > 
> > // driver
> > dma_fence_signal(f);
> > // all callbacks can't reach our driver anymore
> 
> That's irrelevant. The question is not if a callback can reach the backend 
> after signaling.

Irrelevant for your lock-inversion maybe. It's very relevant for life
time and module unload. Proof: you and Tvrtko made dma_fence_signal()
the decoupling point, with RCU protection.

And as I keep saying, that synchronisation point would make the entire
framework simpler and more robust if we were making it consistent.

> 
> The question is if the backend can clean up after dma_fence_signal() 
> completes.
> 
> > struct driver_fence = container_of(f);
> > lock(driver_fence->special_lock);
> > cleanup(f);
> > unlock(…);
> > 
> > Where is the deadlock?
> > 
> > I continue to fail to see it. Show me an example of some code that
> > would deadlock, please. Either fictive or from the kernel.
> 
> 
> Interrupt driven signaling path:
> 
> spin_lock_irqsave(driver->fence_list_lock, flags);
> list_for_each_entry_safe(...) {
>       if (fence->seqno < signaled_seqno)
>               break;
>       dma_fence_signal(fence);
>       list_entry_del(&fence->list);
>       dma_fence_put(fence);
> }
> spin_unlock_irqsave(driver->list_lock, flags);
> 
> Enable signaling path:
> 
> myfence_enable_signaling()
> {
>       if (fence->seqno <= signaled_seqno)
>               return false;
> 
>       talk_to_the_hw();
>       return true;
> }
> 
> The problem here is that the enable_signaling path can't grab the
> driver->fence_list_lock because that would be lock inversion with the
> fence lock.
> 
> Implementations came up with tons of workarounds for this which only
> work more or less correctly. The issues Nouveau had is just the tip
> of the iceberg here.

The fact that every implementor right now can live without driver locks
proofs that it can be done.

The few implementors of enable_signaling() who'd really run into an
inversion could address that with various techniques, like using an RCU
list, llist, rw_lock..

I would imagine the most common action for enable_signaling() is to
check whether an interrupt is already registered. If not, register it.
Is that where you got lock inversions?

> 
> If we nuke the fact that enable_signaling() is called while holding
> the fence lock all that complexity goes away.

I'm afraid you cannot nuke that, Dr. Oppenheimer :p

You need the lock for synchronization. To add the callback.

Basically, you came quite naturally to the same conclusion and solution
as I did: you need / want race-free synchronization.

It's the exact same use-case, really. You want to prevent someone from
interfering with the fence while the ops->callback is running. You need
the lock for that, by definition.

The other two relevant callbacks, signaled() and set_deadline(),
already have the same lock protection guarantee. The former through
dma_fence_is_signaled_locked(), the latter through  the fact that
everyone takes the lock immediately after callback begin anyways.

>  In other words it would then look like this:
> 
> myfence_enable_signaling()
> {
>       spin_lock_irqsave(driver->fence_list_lock, flags);      
>       if (fence->seqno > signaled_seqno) {
>               talk_to_the_hw();

This doesn't need the list lock.

>       } else {
>               dma_fence_signal(fence);

Without the fence lock protection someone else could have signaled the
fence already btw and now be spinning to do the same list deletion.

>               list_del(fence->list);
>               dma_fence_put(fence);
>       }
>       spin_unlock_irqrestore(driver->fence_list_lock, flags); 
> }


Plus I don't see why that callback would have to signal the fence. The
return of the boolean¹ seems very consistent with
dma_fence_is_signaled() to me, which asks "is this job done?" and if
the answer is "yes", the framework signals that fence.

Regarding ops->signaled(), that really just needs to read some
ioremaped u64 integer and then return true or false. That will never
need a lock for anything.

So again: the code is already such that implementors of
enable_signaling() and signaled() need to assume that the fence lock is
being held by someone else. All the practical evidence shows that the
lock protection is actually necessary.

All we have to do is make the existing rules official and the code a
bit more consistent.

P.


[¹] though we probably want an int error code as the return value of
enable_signaling(), to distinguish between error and "already
signaled".

> 
> > Nouveau's enable_signaling() callback does not take locks. Its
> > signaled() does not take locks.
> 
> Yeah because Nouveau reverted like most driver to use the same
> spinlock for the driver lock and the fence lock.
> 
> But that approach is fundamentally broken, a) you can't cleanup from
> the is_signaled path because that can be called with both the lock
> held and not held and b) it doesn't allow the fences to be
> independent of the driver who issued them.
> 
> When it would just be that we consistently call the is_signaled path
> with the lock held I would immediately agree, but that breaks fence
> independence and already caused so many issues that I clearly want to
> remove it.
> 
> Regards,
> Christian.

Re: [RFC PATCH] dma-fence: Fix races of fence callbacks versus destructors by locking

Reply via email to