On Tue, 2026-06-09 at 12:26 +0200, Christian König wrote:
> On 6/9/26 07:52, Philipp Stanner wrote:
> ...
> > > 
> > > In detail calling the callbacks without holding locks allows all 
> > > implementations who need it to explicitly take locks in the order they 
> > > want.
> > 
> > Didn't you say a few mails above that the implementation should not use
> > the fence lock for its own purposes?
> 
> The usual use case the drivers have is this here:
> 
> dma_fence_lock_irqsave(fence, flags);
> was_signaled = dma_fence_test_signaled(fence);
> if (!was_signaled)
>       dma_fence_signal(fence);
> dma_fence_unlock_irqrestore(fence, flags);
> 
> if (!was_signaled)

If cleanup() touches fence data, this is now exactly the race that I am
concerned with.

With the current design, you'd actually need a synchronize_rcu() here,
which you definitely do not want in a hot path :(

>       cleanup();
> 
> This is actually what you and me came up with for the KFD when we
> removed the return code for dma_fence_signal().

Well, what we came up with for that rare case was

was_signaled = dma_fence_check_and_signal(fence);
if (!was_signaled)
        cleanup();

It seems that many of the other use cases where the fence lock is taken
by drivers ultimately are rooted in the (IMO) design mistake that
dma_fence_set_error() exists.

There should just be

void
dma_fence_signal(struct dma_fence *f, int err);

which also conveniently forces all signalers to really carefully think
about whether the fence's associated operations succeeded. Correctly
representing the "true fence error state" to all consumers is vital for
sane system behavior, as you continuously have to point out.

> 
> Taking the lock around the enable_signaling() callback has the exact
> same reason, preventing the fence from signaling between testing and
> calling dma_fence_signal(). The problem with that approach is that
> cleanup() now suddenly runs under the fence lock as well.

That problem does not exist if we re-design dma_fence like that:

// driver
dma_fence_signal(f); // revokes all accesses to our driver through backend_ops
// synchronize_rcu() now unnecessary \o/
cleanup(f); // We know that all accessors are gone
dma_fence_put(f);

> 
> So you are left with few options: Either the fence lock is external,
> which we don't want because that make the fence non-independent, or
> cleanup() defers work to irq_work or work_structs, which creates
> numerous lifetime issues.

Yup, this is uncool and we want to avoid that.

But these seem to be the options

1. Ensure proper synchronization
2. Wait for a grace period in a hot path
3. Defer cleanup() with some delay mechanism

#1 is by far the cleanest approach. I still cannot see any downside,
and quite a few upsides.

https://elixir.bootlin.com/linux/v7.1-rc6/source/drivers/dma-buf/dma-fence.c#L1025

^ is already racing with the signaled check.


I'm going through the users right now, and it seems all we need to
ensure to implement this change is to audit all ops->signaled() for
usage of the fence_lock and see if we can port them.

> 
> When enable_signaling would be independent of the fence lock we could
> avoid all of this and just use a normal spin_lock() for the cleanup.

How's that related to our problem?

P.

Reply via email to