On 6/8/26 18:16, Boris Brezillon wrote:
> On Mon, 08 Jun 2026 17:30:58 +0200
> Philipp Stanner <[email protected]> wrote:
> 
>> On Mon, 2026-06-08 at 17:23 +0200, Danilo Krummrich wrote:
>>> On Mon Jun 8, 2026 at 5:17 PM CEST, Philipp Stanner wrote:  
>>>> On Mon, 2026-06-08 at 17:01 +0200, Boris Brezillon wrote:  
>>>>> On Mon,  8 Jun 2026 16:24:37 +0200
>>>>> Philipp Stanner <[email protected]> wrote:
>>>>>   
>>>>>> @@ -1020,11 +1024,20 @@ EXPORT_SYMBOL(dma_fence_wait_any_timeout);
>>>>>>  void dma_fence_set_deadline(struct dma_fence *fence, ktime_t deadline)
>>>>>>  {
>>>>>>          const struct dma_fence_ops *ops;
>>>>>> +        unsigned long flags;
>>>>>>  
>>>>>>          rcu_read_lock();
>>>>>>          ops = rcu_dereference(fence->ops);
>>>>>> -        if (ops && ops->set_deadline && !dma_fence_is_signaled(fence))
>>>>>> +        if (!ops || !ops->set_deadline) {
>>>>>> +                rcu_read_unlock();
>>>>>> +                return;
>>>>>> +        }
>>>>>> +
>>>>>> +        dma_fence_lock_irqsave(fence, flags);
>>>>>> +        if (!dma_fence_is_signaled_locked(fence))
>>>>>>                  ops->set_deadline(fence, deadline);  
>>>>>
>>>>> You can't take the fence lock around ->set_deadline(), otherwise you'll
>>>>> deadlock here [1] or here [2].
>>>>>   
>>>>>> +
>>>>>> +        dma_fence_unlock_irqrestore(fence, flags);
>>>>>>          rcu_read_unlock();
>>>>>>  }  
>>>>>
>>>>>
>>>>> [1]https://elixir.bootlin.com/linux/v7.0.11/source/drivers/dma-buf/sw_sync.c#L182
>>>>> [2]https://elixir.bootlin.com/linux/v7.0.11/source/drivers/gpu/drm/msm/msm_fence.c#L139
>>>>>   
>>
>> Oh, MSM actually doesn't btw, that's a false positive. That's a
>> distinct spinlock on their fence context object.
> 
> It's not, it's the same lock they attach to all their fences coming
> from this context. It's just that this lock appears to be per-context,
> like is the case for basically every driver, since the inline lock was
> introduced only in this release cycle.
> 
> Anyway, this is all stuff we can fix if people think it's okay to
> protect dma_fence_ops calls with the fence lock. But my point remains:
> each op has its own locking-rules, some are called with the fence lock
> held (enable_signaling(), signaled()), others are not (set_deadline(),
> get_xxx_name()), so we need to carefully audit each of those to make
> sure:
> 
> - calling with the lock held in the new places is not causing a
>   deadlock

Yeah, exactly that.

For example the set_deadline() is intentionally not called with the fence lock 
held because that won't work for some use cases.

The problem when you call ops with a lock held is always that this lock then 
becomes the outermost look held. In other words when you for example want to 
grab a power management lock to implement the deadline feature the framework 
enforces an order between the two locks which isn't desired.

Regards,
Christian.


> - the returned data, if not a scalar, is protected by the RCU read lock
> - any driver implementing ops that can be called without the lock held
>   need to hold on the device data for an RCU grace period
> 
> The last bullet is probably the one I'm the most worried about, because
> instead of a single rule that applies to all ops, we have various cases
> based on whether some ops are implemented or not, but that's already
> the case with deprecated ops like .release() or .wait(), so maybe
> that's okay with the proper doc.
> 
> If I were to choose, I'd probably go for a dedicated rwlock_t to
> protect dma_fence_ops, so we can:
> 
> - protect all dma_buf_ops::xx() consistently no matter the kind of op
> - protect returned data (get_xxx_name()) with this lock instead of the
>   RCU read lock
> 
> But the overhead of this extra lock might not be acceptable, dunno.
> 
>>
>>
>> But yes, before we could upstream this, we would go through all the
>> implementors like Danilo did, to find all the others.
> 
> There's the two I pointed out, plus the array/chain containers I
> mentioned, which are not problematic.

Reply via email to