On Thu, Oct 30, 2025 at 11:59:01AM +0100, Christian König wrote: > On 10/29/25 21:53, Matthew Brost wrote: > > On Tue, Oct 28, 2025 at 03:06:22PM +0100, Christian König wrote: > >> On 10/17/25 10:32, Philipp Stanner wrote: > >>> On Tue, 2025-10-14 at 17:54 +0200, Christian König wrote: > >>>> On 13.10.25 16:54, Philipp Stanner wrote: > >>>>> On Mon, 2025-10-13 at 15:48 +0200, Christian König wrote: > >>>>>> Hi everyone, > >>>>>> > >>>>>> dma_fences have ever lived under the tyranny dictated by the module > >>>>>> lifetime of their issuer, leading to crashes should anybody still > >>>>>> holding > >>>>>> a reference to a dma_fence when the module of the issuer was unloaded. > >>>>>> > >>>>>> But those days are over! The patch set following this mail finally > >>>>>> implements a way for issuers to release their dma_fence out of this > >>>>>> slavery and outlive the module who originally created them. > >>>>>> > >>>>>> Previously various approaches have been discussed, including changing > >>>>>> the > >>>>>> locking semantics of the dma_fence callbacks (by me) as well as using > >>>>>> the > >>>>>> drm scheduler as intermediate layer (by Sima) to disconnect dma_fences > >>>>>> from their actual users. > >>>>>> > >>>>>> Changing the locking semantics turned out to be much more trickier than > >>>>>> originally thought because especially on older drivers (nouveau, > >>>>>> radeon, > >>>>>> but also i915) this locking semantics is actually needed for correct > >>>>>> operation. > >>>>>> > >>>>>> Using the drm_scheduler as intermediate layer is still a good idea and > >>>>>> should probably be implemented to make live simpler for some drivers, > >>>>>> but > >>>>>> doesn't work for all use cases. Especially TLB flush fences, preemption > >>>>>> fences and userqueue fences don't go through the drm scheduler because > >>>>>> it > >>>>>> doesn't make sense for them. > >>>>>> > >>>>>> Tvrtko did some really nice prerequisite work by protecting the > >>>>>> returned > >>>>>> strings of the dma_fence_ops by RCU. This way dma_fence creators where > >>>>>> able to just wait for an RCU grace period after fence signaling before > >>>>>> they could be save to free those data structures. > >>>>>> > >>>>>> Now this patch set here goes a step further and protects the whole > >>>>>> dma_fence_ops structure by RCU, so that after the fence signals the > >>>>>> pointer to the dma_fence_ops is set to NULL when there is no wait nor > >>>>>> release callback given. All functionality which use the dma_fence_ops > >>>>>> reference are put inside an RCU critical section, except for the > >>>>>> deprecated issuer specific wait and of course the optional release > >>>>>> callback. > >>>>>> > >>>>>> Additional to the RCU changes the lock protecting the dma_fence state > >>>>>> previously had to be allocated external. This set here now changes the > >>>>>> functionality to make that external lock optional and allows dma_fences > >>>>>> to use an inline lock and be self contained. > >>>>> > >>>>> Allowing for an embedded lock, is that actually necessary for the goals > >>>>> of this series, or is it an optional change / improvement? > >>>> > >>>> It is kind of necessary because otherwise you can't fully determine the > >>>> lifetime of the lock. > >>>> > >>>> The lock is used to avoid signaling a dma_fence when you modify the > >>>> linked list of callbacks for example. > >>>> > >>>> An alternative would be to protect the lock by RCU as well instead of > >>>> embedding it in the structure, but that would make things even more > >>>> complicated. > >>>> > >>>>> If I understood you correctly at XDC you wanted to have an embedded > >>>>> lock because it improves the memory footprint and because an external > >>>>> lock couldn't achieve some goals about fence-signaling-order originally > >>>>> intended. Can you elaborate on that? > >>>> > >>>> The embedded lock is also nice to have for the dma_fence_array, > >>>> dma_fence_chain and drm_sched_fence, but that just saves a few cache > >>>> lines in some use cases. > >>>> > >>>> The fence-signaling-order is important for drivers like radeon where the > >>>> external lock is protecting multiple fences from signaling at the same > >>>> time and makes sure that everything stays in order. > > > > Not to derail the conversation, but I noticed that dma-fence-arrays can, > > in fact, signal out of order. The issue lies in dma-fence-cb, which > > signals the fence using irq_queue_work. Internally, irq_queue_work uses > > llist, a LIFO structure. So, if two dma-fence-arrays have all their > > fences signaled from a thread, the IRQ work that signals each individual > > dma-fence-array will execute out of order. > > > > We should probably fix this. > > No we don't. That's what I'm trying to point out all the time. > > The original idea of sharing the lock was to guarantee that fence signal in > order, but that never worked correct even for driver fences. > > The background is the optimization we do in the signaling fast path. E.g. > when dma_fence_is_signaled() is called. >
Ah, yes—I see this now. I was operating under the assumption that fences on a timeline must signal in order, but that’s not actually true. What is true is that if a fence later on a timeline signals, all prior fences are complete (i.e., the underlying hardware condition is met, even if the software hasn’t signaled them yet). Could we document this somewhere in the dma-fence kernel docs? I can take a stab at writing it up if you'd like. This is a fairly confusing aspect of dma-fence behavior. Matt > This means that when fence A,B and C are submitted to the HW it is perfectly > possible that somebody query the status of fence B but not A and C. And this > querying of the status is faster than the interrupt which signals A and C. > > So in this scenario B signals before A. > > The only way to avoid that is to not implement the fast path and as far as I > know no real HW driver does that because it makes your driver horrible slow. > > So of to the trash bin with the signaling order, things have worked for over > 10 years without it and as far as I know nobody complained about it. > > Regards, > Christian. > > > > > > Matt > > > >>> > >>> I mean, neither external nor internal lock can somehow force the driver > >>> to signal fences in order, can they? > >> > >> Nope, as I said before this approach is actually pretty useless. > >> > >>> Only the driver can ensure this. > >> > >> Only when the signaled callback is not implemented which basically all > >> driver do. > >> > >> So the whole point of sharing the lock is just not existent any more, it's > >> just that changing it all at once as I tried before results in a way to > >> big patch. > >> > >>> > >>> I am, however, considering modeling something like that on a > >>> FenceContext object: > >>> > >>> fctx.signal_all_fences_up_to_ordered(seqno); > >> > >> Yeah, I have patches for that as well. But then found that amdgpus TLB > >> fences trigger that check and I won't have time to fix it. > >> > >> > >> > >>> > >>> > >>> P. > >>> > >>>> > >>>> While it is possible to change the locking semantics on such old > >>>> drivers, it's probably just better to stay away from it. > >>>> > >>>> Regards, > >>>> Christian. > >>>> > >>>>> > >>>>> P. > >>>>> > >>>>> > >>>>>> > >>>>>> The new approach is then applied to amdgpu allowing the module to be > >>>>>> unloaded even when dma_fences issued by it are still around. > >>>>>> > >>>>>> Please review and comment, > >>>>>> Christian. > >>>>>> > >>>>> > >>>> > >>> > >> >
