On Tue, Oct 28, 2025 at 03:06:22PM +0100, Christian König wrote: > On 10/17/25 10:32, Philipp Stanner wrote: > > On Tue, 2025-10-14 at 17:54 +0200, Christian König wrote: > >> On 13.10.25 16:54, Philipp Stanner wrote: > >>> On Mon, 2025-10-13 at 15:48 +0200, Christian König wrote: > >>>> Hi everyone, > >>>> > >>>> dma_fences have ever lived under the tyranny dictated by the module > >>>> lifetime of their issuer, leading to crashes should anybody still holding > >>>> a reference to a dma_fence when the module of the issuer was unloaded. > >>>> > >>>> But those days are over! The patch set following this mail finally > >>>> implements a way for issuers to release their dma_fence out of this > >>>> slavery and outlive the module who originally created them. > >>>> > >>>> Previously various approaches have been discussed, including changing the > >>>> locking semantics of the dma_fence callbacks (by me) as well as using the > >>>> drm scheduler as intermediate layer (by Sima) to disconnect dma_fences > >>>> from their actual users. > >>>> > >>>> Changing the locking semantics turned out to be much more trickier than > >>>> originally thought because especially on older drivers (nouveau, radeon, > >>>> but also i915) this locking semantics is actually needed for correct > >>>> operation. > >>>> > >>>> Using the drm_scheduler as intermediate layer is still a good idea and > >>>> should probably be implemented to make live simpler for some drivers, but > >>>> doesn't work for all use cases. Especially TLB flush fences, preemption > >>>> fences and userqueue fences don't go through the drm scheduler because it > >>>> doesn't make sense for them. > >>>> > >>>> Tvrtko did some really nice prerequisite work by protecting the returned > >>>> strings of the dma_fence_ops by RCU. This way dma_fence creators where > >>>> able to just wait for an RCU grace period after fence signaling before > >>>> they could be save to free those data structures. > >>>> > >>>> Now this patch set here goes a step further and protects the whole > >>>> dma_fence_ops structure by RCU, so that after the fence signals the > >>>> pointer to the dma_fence_ops is set to NULL when there is no wait nor > >>>> release callback given. All functionality which use the dma_fence_ops > >>>> reference are put inside an RCU critical section, except for the > >>>> deprecated issuer specific wait and of course the optional release > >>>> callback. > >>>> > >>>> Additional to the RCU changes the lock protecting the dma_fence state > >>>> previously had to be allocated external. This set here now changes the > >>>> functionality to make that external lock optional and allows dma_fences > >>>> to use an inline lock and be self contained. > >>> > >>> Allowing for an embedded lock, is that actually necessary for the goals > >>> of this series, or is it an optional change / improvement? > >> > >> It is kind of necessary because otherwise you can't fully determine the > >> lifetime of the lock. > >> > >> The lock is used to avoid signaling a dma_fence when you modify the linked > >> list of callbacks for example. > >> > >> An alternative would be to protect the lock by RCU as well instead of > >> embedding it in the structure, but that would make things even more > >> complicated. > >> > >>> If I understood you correctly at XDC you wanted to have an embedded > >>> lock because it improves the memory footprint and because an external > >>> lock couldn't achieve some goals about fence-signaling-order originally > >>> intended. Can you elaborate on that? > >> > >> The embedded lock is also nice to have for the dma_fence_array, > >> dma_fence_chain and drm_sched_fence, but that just saves a few cache lines > >> in some use cases. > >> > >> The fence-signaling-order is important for drivers like radeon where the > >> external lock is protecting multiple fences from signaling at the same > >> time and makes sure that everything stays in order.
Not to derail the conversation, but I noticed that dma-fence-arrays can, in fact, signal out of order. The issue lies in dma-fence-cb, which signals the fence using irq_queue_work. Internally, irq_queue_work uses llist, a LIFO structure. So, if two dma-fence-arrays have all their fences signaled from a thread, the IRQ work that signals each individual dma-fence-array will execute out of order. We should probably fix this. Matt > > > > I mean, neither external nor internal lock can somehow force the driver > > to signal fences in order, can they? > > Nope, as I said before this approach is actually pretty useless. > > > Only the driver can ensure this. > > Only when the signaled callback is not implemented which basically all driver > do. > > So the whole point of sharing the lock is just not existent any more, it's > just that changing it all at once as I tried before results in a way to big > patch. > > > > > I am, however, considering modeling something like that on a > > FenceContext object: > > > > fctx.signal_all_fences_up_to_ordered(seqno); > > Yeah, I have patches for that as well. But then found that amdgpus TLB fences > trigger that check and I won't have time to fix it. > > > > > > > > > P. > > > >> > >> While it is possible to change the locking semantics on such old drivers, > >> it's probably just better to stay away from it. > >> > >> Regards, > >> Christian. > >> > >>> > >>> P. > >>> > >>> > >>>> > >>>> The new approach is then applied to amdgpu allowing the module to be > >>>> unloaded even when dma_fences issued by it are still around. > >>>> > >>>> Please review and comment, > >>>> Christian. > >>>> > >>> > >> > > >
