Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT

Joel Fernandes Thu, 19 Mar 2026 11:43:18 -0700

On 3/19/2026 1:44 PM, Boqun Feng wrote:
> On Thu, Mar 19, 2026 at 06:02:44PM +0100, Sebastian Andrzej Siewior wrote:
>> On 2026-03-19 09:48:16 [-0700], Boqun Feng wrote:
>>> I agree it's not RCU's fault ;-)
>>
>> I never claimed it is anyone's fault. I just see that BPF should be able
>> to do things which kgdb would not be allowed to.
>>
>>> I guess it'll be difficult to restrict BPF, however maybe BPF can call
>>> call_srcu() in irq_work instead? Or a more systematic defer mechanism
>>> that allows BPF to defer any lock holding functions to a different
>>> context. (We have a similar issue that BPF cannot call kfree_rcu() in
>>> some cases IIRC).
>>>
>>> But we need to fix this in v7.0, so this short-term fix is still needed.
>>
>> I would prefer something substantial before we rush to get a quick fix
>> and move on.
>>
> 
> The quick fix here is really "restore the previous behavior of
> call_rcu_tasks_trace() in call_srcu()", and the future work will


Unfortunately reverting c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in
terms of SRCU-fast") is tricky since the original body of RCU Tasks Trace code
is deleted. Perhaps we should have added an easier escape-hatch, lesson learnt:)

> naturally happen: if the extra irq_work layer turns out calling issues
> to other SRCU users, then we need to fix them as well. Otherwise, there
> is no real need to avoid the extra irq_work hop. So I *think* it's OK
> ;-)
> 
> Cleaning up all the ad-hoc irq_work usages in BPF is another thing,
> which can happen if we learn about all the cases and have a good design.
> 
>> If we could get that irq_work() part only for BPF where it is required
>> then it would be already a step forward.
>>
> 
> I'm happy to include that (i.e. using Qiang's suggestion) if Joel also
> agrees.

Sure, I am Ok with sort of short-term fix, but I worry that it still does not
the issues due to the tasks-trace conversion. In particular, it doesn't fix the
issue Andrea reported AFAICS, because there is a dependency on pool->lock? see:
https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/

That happens precisely because of the queue_delayed_work() happening from the
SRCU tasks-trace specific BPF right?

This looks something like this, due to combination of SRCU, scheduler and WQ:

srcu_usage.lock -> pool->lock -> pi_lock -> rq->__lock
       ^                                       |
       |                                       |
       +----------- DEADLOCK CYCLE ------------+

>> Long term it would be nice if we could avoid calling this while locks
>> are held. I think call_rcu() can't be used under rq/pi lock, but timers
>> should be fine.
>>
>> Is this rq/pi locking originating from "regular" BPF code or sched_ext?
>>
> 
> I think if you have any tracepoint (include traceable functions) under
> rq/pi locking, then potentially BPF can call call_srcu() there.

> 
> The root cause of the issues is that BPF is actually like a NMI unless
> the code is noinstr (There is a rabit hole about BPF calling
> call_srcu() while it's instrumenting call_srcu() itself). And the right
> way to solve all the issues is to have a general defer mechanism for
> BPF.
Will that really solve the above mentioned issue though that Andrea reported?

+Andrea, +Steve as well.

thanks,

--
Joel Fernandes

Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT

Reply via email to