Re: [RFC PATCH] dma-fence: Fix races of fence callbacks versus destructors by locking

Christian König Thu, 11 Jun 2026 04:07:17 -0700

On 6/11/26 11:50, Philipp Stanner wrote:
> On Thu, 2026-06-11 at 11:14 +0200, Christian König wrote:
>> On 6/11/26 10:35, Philipp Stanner wrote:
>>> On Wed, 2026-06-10 at 17:15 +0200, Christian König wrote:
>>>> On 6/10/26 16:25, Philipp Stanner wrote:
>>>>>
>>>>> Are you referring to this comment from the documentation?
>>>>>
>>>>>  * Since many implementations can call dma_fence_signal() even
>>>>> when before
>>>>>  * @enable_signaling has been called there's a race window,
>>>>> where the
>>>>>  * dma_fence_signal() might result in the final fence reference
>>>>> being
>>>>>  * released and its memory freed. To avoid this,
>>>>> implementations of this
>>>>>  * callback should grab their own reference using
>>>>> dma_fence_get(), to be
>>>>>  * released when the fence is signalled (through e.g. the
>>>>> interrupt
>>>>>  * handler).
>>>>>  *
>>>>>  * This callback is optional. If this callback is not present,
>>>>> then the
>>>>>  * driver must always have signaling enabled.
>>>>>  */
>>>>>  bool (*enable_signaling)(struct dma_fence *fence);
>>>>
>>>> Yes, that was an extremely bad idea which I have tried multiple
>>>> times to fix.
>>>
>>> What's the reason why enable_signaling() is nowadays called with
>>> lock-
>>> protection?
>>
>> That was some decision originally made a long long time ago because
>> the initial thought was that fence need to signal in order, but that
>> concept was also abandoned a long long time ago.
> 
> You also need it to avoid a race with the driver's requested signalling
> in ops->enable_signalling(), through dma_fence_add_callback().
> 
> I pointed that out in my answer. You cut it out and did not react to
> that statement.


Yeah because I don't think it's relevant.

You don't have a race between enabling signaling and adding the callback, that 
are two independent functions.

Enable signaling is something which must be done before adding the callbacks, 
that's granted but they don't need a common lock protection.

>> Fixing this is on my TODO list ever since we figured out that this
>> was a bad idea, see the latest patch set here as well:
>> https://www.spinics.net/lists/dri-devel/msg461253.html
>>
>>>>> Sure you can. If everything is properly synchronized.
>>>>>
>>>>> // driver
>>>>> dma_fence_signal(f);
>>>>> // all callbacks can't reach our driver anymore
>>>>
>>>> That's irrelevant. The question is not if a callback can reach
>>>> the backend after signaling.
>>>
>>> Irrelevant for your lock-inversion maybe. It's very relevant for
>>> life
>>> time and module unload. Proof: you and Tvrtko made
>>> dma_fence_signal()
>>> the decoupling point, with RCU protection.
>>>
>>> And as I keep saying, that synchronisation point would make the
>>> entire
>>> framework simpler and more robust if we were making it consistent.
>>
>> Exactly that's what I strongly disagree on.
> 
> Yeah, and you don't provide any rationale AFAICS. The argument
> basically is "this is how it is, and it is how it is because it is like
> that". It's "this is done with RCU instead of locks, because you can
> use RCU for that".
> 
> No offense intended, but I just can't see any substantial argument
> except for non-existing deadlocks that COULD ALREADY OCCUR and have to
> be mitigated by the drivers already.

Well those deadlocks absolutely do exists. I mean just take look at how bad the 
drivers have to work around the issue that enable_signaling and is_signaled can 
be called with the lock held.

>> The fence lock and the synchronization point which allows the driver
>> to know when nobody else is in a callback any more are two very
>> different things we should not be mixed up together.
> 
> Why? It's literally what locks exist for. For synchronization.

No, locks exists for protection of state.

> RCU has been literally developed for deadlock avoidance, which its
> usage in dma_fence provably doesn't achieve. You still have the
> deadlock problem, and on top of that you have redundant, unnecessary
> RCU calls.

Ok, where exactly do you see the problem? 

>> The fence lock protects the signaled state of the fence, so that the
>> fence functions can add callbacks, test signaling etc... while the
>> fence can't signal. It is protecting internal state of the fence and
>> so should be internal to the fence, external code should not touch it
>> if possible.
> 
> You allow people to take the lock manually with the API function
> dma_fence_lock_irqsave().
> 
> But you know what, I absolutely agree!
> 
> I think if we take the locks for the callbacks and remove
> dma_fence_set_error(), then the need to ever touch the lock manually
> disappears.

I agree on that but that makes it impossible for the dma_fence to independent 
of the driver who issues it.

> 
> Then the driver knows what I pointed out at least three times now:
> 
> dma_fence_signal(f);
> cleanup(f); // perfectly safe, bc after signal() all accesors are done
> 
> No RCU, no nothing. Just plain proper synchronization.

No, that doesn't work because you have multiple paths to dma_fence_signal():

A) Driver is calling that from IRQ.
B) Through calling dma_fence_enable_sw_signaling() and the underlying callback 
returning false.
C) Through calling dma_fence_is_signaled() and the underlying callback 
returning true.

For A you clearly can use that approach, but it mangles every attempt to 
cleanup for B & C because the driver can't grab locks which it would grab on 
path A.

It's really a picture book classic lock inversion problem with callbacks.

> But when I point that out, you say "No no, you can't do that, that's
> not the purpose of the lock, you have to do:
> 
> dma_fence_signal(f);
> call_rcu(cleanup(f));
> "
> 
> ???

Yes, what exactly is wrong with that? I mean you must have some kind of 
technical issue with that which I don't understand.

> 
>>
>> The synchronization point for the driver is a service the dma_fence
>> implementation offers to let drivers know when there is no more
>> caller of their function. And usually this is implemented using
>> RCU/SRCU.
> 
> "this is how it is because it is how it is".
> 
> The one and only reason why you need RCU is that you don't lock
> consistently.
> 
> Using RCU everywhere is nothing but symptom treatment.
> 
>>
>> Tvrko and I now choose RCU instead of SRCU because there didn't seem
>> to be a need to sleep in the fence callbacks.
> 
> You can't sleep in the callbacks, because the callbacks are invoked
> with the fence spinlock being held.

Yes which is the case at the moment, but that is something we want to get away 
from.

>>
>> But both RCU and SRCU offer not only synchronize_rcu()/_srcu() but
>> also call_rcu()/_srcu() which allows drivers to delegate the cleanup
>> to a point where it is save to do so.
> 
> "We use RCU instead of locks because we can use RCU instead of locks".
> This is not an argument against locks, Christian.
> 
> If you lock consistently, cleaning up after dma_fence_signal() is
> guaranteed to be safe to do, immediately.

Yes I agree on that, but you completely mess up all alternative signaling paths 
and that is something I can't accept.

>>
>> So while I see the problem I actually don't understand why you insist
>> of solving it with the fence lock?
> 
> Well, that's then a philosophical moment, because I on the contrary
> don't understand why you insist on solving it with RCU :D

Well because of the technical problems I see :D

> The tl;dr is: using the locks makes the entire implementation more
> robust and consistent and allows for removing the RCU delays.
> 
> I showed you a bug that would have been prevented with locks. It was
> caused by the improper synchronization, where dma_fence_is_signaled()
> can return true when the callbacks have not yet been run.

No, that problem wouldn't be fixed by that.

Even when we call the callbacks with the locks held has no effect on the 
handling in dma_fence_is_signaled().

We would also need to move the testing if a fence is signaled or not under the 
lock and I can guarantee you that there will be push back on this.

> The solution is to use
> 
> dma_fence_lock_irqsave(f);
> if (dma_fence_is_signaled_locked(f))
>   cleanup(f);
> dma_fence_unlock_irqrestore(f);
> 
> IOW… the solution is to call the ops->signaled() callback with the lock
> held……… notice something? ;)
> 
> I provided more rationale on the rest of my mail, but you cut that out.
> Here, let me repost it:
> 
> 
> "
> You need the lock for synchronization. To add the callback.
> 
> Basically, you came quite naturally to the same conclusion and solution
> as I did: you need / want race-free synchronization.
> 
> It's the exact same use-case, really. You want to prevent someone from
> interfering with the fence while the ops->callback is running. You need
> the lock for that, by definition.
> 
> The other two relevant callbacks, signaled() and set_deadline(),
> already have the same lock protection guarantee. The former through
> dma_fence_is_signaled_locked(), the latter through  the fact that
> everyone takes the lock immediately after callback begin anyways.
> "
> 
> and
> 
> "
> All we have to do is make the existing rules official and the code a
> bit more consistent.
> "
> 
> 
> The deadlock argument doesn't hold up because the lock is already being
> taken for enable_signalling() and signaled().
> 
> I showed you one Nouveau bug which the locks would have prevented.

Well I can clearly disproved that.

> Furthermore, I argue that it's impossible for you to remove the lock
> protection from ops->enable_signalling().

Yeah that is something I don't understand. Where exactly do you see the problem 
here?

We already have patches which does that which have been reviewed and tested and 
there wasn't any problem regarding that. It can be that we missed this because 
it was never pushed upstream.

> I further argue that consistent lock protection would allow for
> removing the need for dma_fence_is_signaled_locked() and
> dma_fence_lock_irqsave() as public APIs.

I agree that this goal would be nice to have, but I don't see how your proposal 
will help with that.

Regards,
Christian.

> 
> 
>>  That just brings us back to the bad design we had before where
>> internal fence state is abused to protect something else.
> 
> If we apply my design, we could deprecate dma_fence_lock_irqsave() and
> thereby declare it illegal to do that.
> 
> Relying on dma_fence_signal() being a point of synchronization is
> completely different from using its internal lock to protect your own
> list or tree or whatever.



> 
> 
> P.
> 
>>
>> Regards,
>> Christian.

Re: [RFC PATCH] dma-fence: Fix races of fence callbacks versus destructors by locking

Reply via email to