On 6/9/26 12:42, Philipp Stanner wrote:
> On Tue, 2026-06-09 at 12:26 +0200, Christian König wrote:
>> On 6/9/26 07:52, Philipp Stanner wrote:
>> ...
>>>>
>>>> In detail calling the callbacks without holding locks allows all 
>>>> implementations who need it to explicitly take locks in the order they 
>>>> want.
>>>
>>> Didn't you say a few mails above that the implementation should not use
>>> the fence lock for its own purposes?
>>
>> The usual use case the drivers have is this here:
>>
>> dma_fence_lock_irqsave(fence, flags);
>> was_signaled = dma_fence_test_signaled(fence);
>> if (!was_signaled)
>>      dma_fence_signal(fence);
>> dma_fence_unlock_irqrestore(fence, flags);
>>
>> if (!was_signaled)
> 
> If cleanup() touches fence data, this is now exactly the race that I am
> concerned with.
> 
> With the current design, you'd actually need a synchronize_rcu() here,
> which you definitely do not want in a hot path :(
> 
>>      cleanup();
>>
>> This is actually what you and me came up with for the KFD when we
>> removed the return code for dma_fence_signal().
> 
> Well, what we came up with for that rare case was
> 
> was_signaled = dma_fence_check_and_signal(fence);
> if (!was_signaled)
>       cleanup();
> 
> It seems that many of the other use cases where the fence lock is taken
> by drivers ultimately are rooted in the (IMO) design mistake that
> dma_fence_set_error() exists.
> 
> There should just be
> 
> void
> dma_fence_signal(struct dma_fence *f, int err);
> 
> which also conveniently forces all signalers to really carefully think
> about whether the fence's associated operations succeeded. Correctly
> representing the "true fence error state" to all consumers is vital for
> sane system behavior, as you continuously have to point out.

Yeah that came to my mind before as well.

>>
>> Taking the lock around the enable_signaling() callback has the exact
>> same reason, preventing the fence from signaling between testing and
>> calling dma_fence_signal(). The problem with that approach is that
>> cleanup() now suddenly runs under the fence lock as well.
> 
> That problem does not exist if we re-design dma_fence like that:
> 
> // driver
> dma_fence_signal(f); // revokes all accesses to our driver through backend_ops
> // synchronize_rcu() now unnecessary \o/
> cleanup(f); // We know that all accessors are gone
> dma_fence_put(f);

Yeah and exactly that doesn't work.

Just think about the Nouveau case when you have your fences on a double linked 
list.

When the fence lock is independent, e.g. have a separate lock for each fence 
then this lock can't protect this double linked list.

So your cleanup path needs to take a lock which protects the list, but you then 
run into lock inversion.

>>
>> So you are left with few options: Either the fence lock is external,
>> which we don't want because that make the fence non-independent, or
>> cleanup() defers work to irq_work or work_structs, which creates
>> numerous lifetime issues.
> 
> Yup, this is uncool and we want to avoid that.
> 
> But these seem to be the options
> 
> 1. Ensure proper synchronization
> 2. Wait for a grace period in a hot path
> 3. Defer cleanup() with some delay mechanism
> 
> #1 is by far the cleanest approach. I still cannot see any downside,
> and quite a few upsides.
> 
> https://elixir.bootlin.com/linux/v7.1-rc6/source/drivers/dma-buf/dma-fence.c#L1025
> 
> ^ is already racing with the signaled check.

Yeah so what? That is just an opportunistic check. 

> I'm going through the users right now, and it seems all we need to
> ensure to implement this change is to audit all ops->signaled() for
> usage of the fence_lock and see if we can port them.
> 
>>
>> When enable_signaling would be independent of the fence lock we could
>> avoid all of this and just use a normal spin_lock() for the cleanup.
> 
> How's that related to our problem?

See the explanation above for the Nouveau example.

As far as I can see the approach you want to implement here is a NO-GO from my 
side.

Regards,
Christian.

> 
> P.

Reply via email to