On Wed, 18 Feb 2026 09:50:56 +0000
Alice Ryhl <[email protected]> wrote:

> On Tue, Feb 17, 2026 at 03:28:06PM +0100, Philipp Stanner wrote:
> > On Tue, 2026-02-17 at 15:22 +0100, Christian König wrote:  
> > > On 2/17/26 15:09, Alice Ryhl wrote:  
> > > > On Tue, Feb 17, 2026 at 3:04 PM Philipp Stanner <[email protected]> 
> > > > wrote:  
> > > > > > > > 
> > > > > > > >   
> > 
> > […]
> >   
> > > > > > > > Thinking more about it you should probably enforce that there 
> > > > > > > > is only
> > > > > > > > one signaling path for each fence signaling.  
> > > > > > > 
> > > > > > > I'm not really convinced by this.
> > > > > > > 
> > > > > > > First, the timeout path must be a fence signalling path because 
> > > > > > > the
> > > > > > > reason you have a timeout in the first place is because the hw 
> > > > > > > might
> > > > > > > never signal the fence. So if the timeout path deadlocks on a
> > > > > > > kmalloc(GFP_KERNEL) and the hw never comes around to wake you up, 
> > > > > > > boom.  
> > > > > > 
> > > > > > Mhm, good point. On the other hand the timeout handling should 
> > > > > > probably be considered part of the normal signaling path.  
> > > > > 
> > > > > 
> > > > > Why would anyone want to allocate in a timeout path in the first 
> > > > > place – especially for jobqueue?
> > > > > 
> > > > > Timeout -> close the associated ring. Done.
> > > > > JobQueue will signal the done_fences with -ECANCELED.
> > > > > 
> > > > > What would the driver want to allocate in its timeout path, i.e.: 
> > > > > timeout callback.  
> > > > 
> > > > Maybe you need an allocation to hold the struct delayed_work_struct
> > > > field that you use to enqueue the timeout?  
> > > 
> > > And the workqueue were you schedule the delayed_work on must have the 
> > > reclaim bit set.
> > > 
> > > Otherwise it can be that the workqueue finds all kthreads busy and tries 
> > > to start a new one, e.g. allocating task structure......  
> > 
> > OK, maybe I'm lost, but what delayed_work?
> > 
> > The jobqueue's delayed work item gets either created on JQ::new() or in
> > jq.submit_job(). Why would anyone – that is: any driver – implement a
> > delayed work in its timeout callback?
> > 
> > That doesn't make sense.
> > 
> > JQ notifies the driver from its delayed_work through
> > timeout_callback(), and in that callback the driver closes the
> > associated firmware ring.
> > 
> > And it drops the JQ. So it is gone. A new JQ will get a new timeout
> > work item.
> > 
> > That's basically all the driver must ever do. Maybe some logging and
> > stuff.
> > 
> > With firmware scheduling it should really be that simple.
> > 
> > And signalling / notifying userspace gets done by jobqueue.
> > 
> > Right?  
> 
> What I'm getting at is that a driver author might attempt to implement
> their own timeout logic instead of using the job queue, and if they do,
> they might get it wrong in the way I described.
> 
> You're correct that they shouldn't do this. But you asked how a driver
> author might get the timeout wrong, and doing it the wrong way is one
> such way they might do it in the wrong way.

Are we back to discussing "how to ensure nothing prohibited happens in
the DMA signalling path?" or is this something else? I mean, I'm
convinced timeout handling should be part of the DMA-signalling path,
no matter if it's in common/well-audited code like JobQueue, or some
custom driver timeout handling (which I'm not advocating for, just to
be clear). As such, I believe we should ensure XxxDmaFence::signal()
(I'm using Xxx because the name of this Signal-able object is still
undecided AFAIK :-)) is called inside a DMA-signalling section. Note
that dma_fence_signal() declares a signalling section before signaling,
so this check would have to be done before calling dma_fence_signal() in
the XxxDmaFence::signal() implementation.

If we go this way, with

- DmaFenceWorkqueue+DmaFenceWork: generic abstractions for DMA-fence
  constrained works
- DmaFenceThreadedHandler: generic abstraction for a DMA-fence
  constrained threaded IRQ handler (raw IRQ handlers are already more
  constrained than the DMA-fence signalling path, so we don't care)
- and potentially other helpers for other kind of deferred signalling

we should be covered.

I believe that covers the case Alice was describing where the driver
allocates a DelayedWork with GFP_KERNEL in a DMA-signalling path, which
is prohibited. So yeah, if a driver decides to go for its own watchdog
implementation signalling all fences manually, it will just be
constrained by the same rules. XxxDmaFence::signal() will yell at you
if you're not in an annotated DMA-fence signalling section, and if you
are and you're doing something prohibited it will also yell at you. The
only way to abuse this is if rust code decides to manually annotate a
section, which we can flag as unsafe to make it clear this is not
something you should play with unless you're well aware of the risks.

Am I missing something?

Reply via email to