Re: Independence for dma_fences!

Matthew Brost Fri, 31 Oct 2025 10:45:12 -0700

On Thu, Oct 30, 2025 at 11:59:01AM +0100, Christian König wrote:
> On 10/29/25 21:53, Matthew Brost wrote:
> > On Tue, Oct 28, 2025 at 03:06:22PM +0100, Christian König wrote:
> >> On 10/17/25 10:32, Philipp Stanner wrote:
> >>> On Tue, 2025-10-14 at 17:54 +0200, Christian König wrote:
> >>>> On 13.10.25 16:54, Philipp Stanner wrote:
> >>>>> On Mon, 2025-10-13 at 15:48 +0200, Christian König wrote:
> >>>>>> Hi everyone,
> >>>>>>
> >>>>>> dma_fences have ever lived under the tyranny dictated by the module
> >>>>>> lifetime of their issuer, leading to crashes should anybody still 
> >>>>>> holding
> >>>>>> a reference to a dma_fence when the module of the issuer was unloaded.
> >>>>>>
> >>>>>> But those days are over! The patch set following this mail finally
> >>>>>> implements a way for issuers to release their dma_fence out of this
> >>>>>> slavery and outlive the module who originally created them.
> >>>>>>
> >>>>>> Previously various approaches have been discussed, including changing 
> >>>>>> the
> >>>>>> locking semantics of the dma_fence callbacks (by me) as well as using 
> >>>>>> the
> >>>>>> drm scheduler as intermediate layer (by Sima) to disconnect dma_fences
> >>>>>> from their actual users.
> >>>>>>
> >>>>>> Changing the locking semantics turned out to be much more trickier than
> >>>>>> originally thought because especially on older drivers (nouveau, 
> >>>>>> radeon,
> >>>>>> but also i915) this locking semantics is actually needed for correct
> >>>>>> operation.
> >>>>>>
> >>>>>> Using the drm_scheduler as intermediate layer is still a good idea and
> >>>>>> should probably be implemented to make live simpler for some drivers, 
> >>>>>> but
> >>>>>> doesn't work for all use cases. Especially TLB flush fences, preemption
> >>>>>> fences and userqueue fences don't go through the drm scheduler because 
> >>>>>> it
> >>>>>> doesn't make sense for them.
> >>>>>>
> >>>>>> Tvrtko did some really nice prerequisite work by protecting the 
> >>>>>> returned
> >>>>>> strings of the dma_fence_ops by RCU. This way dma_fence creators where
> >>>>>> able to just wait for an RCU grace period after fence signaling before
> >>>>>> they could be save to free those data structures.
> >>>>>>
> >>>>>> Now this patch set here goes a step further and protects the whole
> >>>>>> dma_fence_ops structure by RCU, so that after the fence signals the
> >>>>>> pointer to the dma_fence_ops is set to NULL when there is no wait nor
> >>>>>> release callback given. All functionality which use the dma_fence_ops
> >>>>>> reference are put inside an RCU critical section, except for the
> >>>>>> deprecated issuer specific wait and of course the optional release
> >>>>>> callback.
> >>>>>>
> >>>>>> Additional to the RCU changes the lock protecting the dma_fence state
> >>>>>> previously had to be allocated external. This set here now changes the
> >>>>>> functionality to make that external lock optional and allows dma_fences
> >>>>>> to use an inline lock and be self contained.
> >>>>>
> >>>>> Allowing for an embedded lock, is that actually necessary for the goals
> >>>>> of this series, or is it an optional change / improvement?
> >>>>
> >>>> It is kind of necessary because otherwise you can't fully determine the 
> >>>> lifetime of the lock.
> >>>>
> >>>> The lock is used to avoid signaling a dma_fence when you modify the 
> >>>> linked list of callbacks for example.
> >>>>
> >>>> An alternative would be to protect the lock by RCU as well instead of 
> >>>> embedding it in the structure, but that would make things even more 
> >>>> complicated.
> >>>>
> >>>>> If I understood you correctly at XDC you wanted to have an embedded
> >>>>> lock because it improves the memory footprint and because an external
> >>>>> lock couldn't achieve some goals about fence-signaling-order originally
> >>>>> intended. Can you elaborate on that?
> >>>>
> >>>> The embedded lock is also nice to have for the dma_fence_array, 
> >>>> dma_fence_chain and drm_sched_fence, but that just saves a few cache 
> >>>> lines in some use cases.
> >>>>
> >>>> The fence-signaling-order is important for drivers like radeon where the 
> >>>> external lock is protecting multiple fences from signaling at the same 
> >>>> time and makes sure that everything stays in order.
> > 
> > Not to derail the conversation, but I noticed that dma-fence-arrays can,
> > in fact, signal out of order. The issue lies in dma-fence-cb, which
> > signals the fence using irq_queue_work. Internally, irq_queue_work uses
> > llist, a LIFO structure. So, if two dma-fence-arrays have all their
> > fences signaled from a thread, the IRQ work that signals each individual
> > dma-fence-array will execute out of order.
> > 
> > We should probably fix this.
> 
> No we don't. That's what I'm trying to point out all the time.
> 
> The original idea of sharing the lock was to guarantee that fence signal in 
> order, but that never worked correct even for driver fences.
> 
> The background is the optimization we do in the signaling fast path. E.g. 
> when dma_fence_is_signaled() is called.
>


Ah, yes—I see this now. I was operating under the assumption that fences
on a timeline must signal in order, but that’s not actually true. What
is true is that if a fence later on a timeline signals, all prior fences
are complete (i.e., the underlying hardware condition is met, even if
the software hasn’t signaled them yet).

Could we document this somewhere in the dma-fence kernel docs? I can
take a stab at writing it up if you'd like. This is a fairly confusing
aspect of dma-fence behavior.

Matt

> This means that when fence A,B and C are submitted to the HW it is perfectly 
> possible that somebody query the status of fence B but not A and C. And this 
> querying of the status is faster than the interrupt which signals A and C.
> 
> So in this scenario B signals before A.
> 
> The only way to avoid that is to not implement the fast path and as far as I 
> know no real HW driver does that because it makes your driver horrible slow.
> 
> So of to the trash bin with the signaling order, things have worked for over 
> 10 years without it and as far as I know nobody complained about it.
> 
> Regards,
> Christian.
>  
> 
> > 
> > Matt
> > 
> >>>
> >>> I mean, neither external nor internal lock can somehow force the driver
> >>> to signal fences in order, can they?
> >>
> >> Nope, as I said before this approach is actually pretty useless.
> >>
> >>> Only the driver can ensure this.
> >>
> >> Only when the signaled callback is not implemented which basically all 
> >> driver do.
> >>
> >> So the whole point of sharing the lock is just not existent any more, it's 
> >> just that changing it all at once as I tried before results in a way to 
> >> big patch.
> >>
> >>>
> >>> I am, however, considering modeling something like that on a
> >>> FenceContext object:
> >>>
> >>> fctx.signal_all_fences_up_to_ordered(seqno);
> >>
> >> Yeah, I have patches for that as well. But then found that amdgpus TLB 
> >> fences trigger that check and I won't have time to fix it.
> >>
> >>
> >>
> >>>
> >>>
> >>> P.
> >>>
> >>>>
> >>>> While it is possible to change the locking semantics on such old 
> >>>> drivers, it's probably just better to stay away from it.
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>>
> >>>>> P.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> The new approach is then applied to amdgpu allowing the module to be
> >>>>>> unloaded even when dma_fences issued by it are still around.
> >>>>>>
> >>>>>> Please review and comment,
> >>>>>> Christian.
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>

Re: Independence for dma_fences!

Reply via email to