On Tue, Oct 28, 2025 at 03:06:22PM +0100, Christian König wrote:
> On 10/17/25 10:32, Philipp Stanner wrote:
> > On Tue, 2025-10-14 at 17:54 +0200, Christian König wrote:
> >> On 13.10.25 16:54, Philipp Stanner wrote:
> >>> On Mon, 2025-10-13 at 15:48 +0200, Christian König wrote:
> >>>> Hi everyone,
> >>>>
> >>>> dma_fences have ever lived under the tyranny dictated by the module
> >>>> lifetime of their issuer, leading to crashes should anybody still holding
> >>>> a reference to a dma_fence when the module of the issuer was unloaded.
> >>>>
> >>>> But those days are over! The patch set following this mail finally
> >>>> implements a way for issuers to release their dma_fence out of this
> >>>> slavery and outlive the module who originally created them.
> >>>>
> >>>> Previously various approaches have been discussed, including changing the
> >>>> locking semantics of the dma_fence callbacks (by me) as well as using the
> >>>> drm scheduler as intermediate layer (by Sima) to disconnect dma_fences
> >>>> from their actual users.
> >>>>
> >>>> Changing the locking semantics turned out to be much more trickier than
> >>>> originally thought because especially on older drivers (nouveau, radeon,
> >>>> but also i915) this locking semantics is actually needed for correct
> >>>> operation.
> >>>>
> >>>> Using the drm_scheduler as intermediate layer is still a good idea and
> >>>> should probably be implemented to make live simpler for some drivers, but
> >>>> doesn't work for all use cases. Especially TLB flush fences, preemption
> >>>> fences and userqueue fences don't go through the drm scheduler because it
> >>>> doesn't make sense for them.
> >>>>
> >>>> Tvrtko did some really nice prerequisite work by protecting the returned
> >>>> strings of the dma_fence_ops by RCU. This way dma_fence creators where
> >>>> able to just wait for an RCU grace period after fence signaling before
> >>>> they could be save to free those data structures.
> >>>>
> >>>> Now this patch set here goes a step further and protects the whole
> >>>> dma_fence_ops structure by RCU, so that after the fence signals the
> >>>> pointer to the dma_fence_ops is set to NULL when there is no wait nor
> >>>> release callback given. All functionality which use the dma_fence_ops
> >>>> reference are put inside an RCU critical section, except for the
> >>>> deprecated issuer specific wait and of course the optional release
> >>>> callback.
> >>>>
> >>>> Additional to the RCU changes the lock protecting the dma_fence state
> >>>> previously had to be allocated external. This set here now changes the
> >>>> functionality to make that external lock optional and allows dma_fences
> >>>> to use an inline lock and be self contained.
> >>>
> >>> Allowing for an embedded lock, is that actually necessary for the goals
> >>> of this series, or is it an optional change / improvement?
> >>
> >> It is kind of necessary because otherwise you can't fully determine the 
> >> lifetime of the lock.
> >>
> >> The lock is used to avoid signaling a dma_fence when you modify the linked 
> >> list of callbacks for example.
> >>
> >> An alternative would be to protect the lock by RCU as well instead of 
> >> embedding it in the structure, but that would make things even more 
> >> complicated.
> >>
> >>> If I understood you correctly at XDC you wanted to have an embedded
> >>> lock because it improves the memory footprint and because an external
> >>> lock couldn't achieve some goals about fence-signaling-order originally
> >>> intended. Can you elaborate on that?
> >>
> >> The embedded lock is also nice to have for the dma_fence_array, 
> >> dma_fence_chain and drm_sched_fence, but that just saves a few cache lines 
> >> in some use cases.
> >>
> >> The fence-signaling-order is important for drivers like radeon where the 
> >> external lock is protecting multiple fences from signaling at the same 
> >> time and makes sure that everything stays in order.

Not to derail the conversation, but I noticed that dma-fence-arrays can,
in fact, signal out of order. The issue lies in dma-fence-cb, which
signals the fence using irq_queue_work. Internally, irq_queue_work uses
llist, a LIFO structure. So, if two dma-fence-arrays have all their
fences signaled from a thread, the IRQ work that signals each individual
dma-fence-array will execute out of order.

We should probably fix this.

Matt

> > 
> > I mean, neither external nor internal lock can somehow force the driver
> > to signal fences in order, can they?
> 
> Nope, as I said before this approach is actually pretty useless.
> 
> > Only the driver can ensure this.
> 
> Only when the signaled callback is not implemented which basically all driver 
> do.
> 
> So the whole point of sharing the lock is just not existent any more, it's 
> just that changing it all at once as I tried before results in a way to big 
> patch.
> 
> > 
> > I am, however, considering modeling something like that on a
> > FenceContext object:
> > 
> > fctx.signal_all_fences_up_to_ordered(seqno);
> 
> Yeah, I have patches for that as well. But then found that amdgpus TLB fences 
> trigger that check and I won't have time to fix it.
> 
> 
> 
> > 
> > 
> > P.
> > 
> >>
> >> While it is possible to change the locking semantics on such old drivers, 
> >> it's probably just better to stay away from it.
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> P.
> >>>
> >>>
> >>>>
> >>>> The new approach is then applied to amdgpu allowing the module to be
> >>>> unloaded even when dma_fences issued by it are still around.
> >>>>
> >>>> Please review and comment,
> >>>> Christian.
> >>>>
> >>>
> >>
> > 
> 

Reply via email to