Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT

Joel Fernandes Thu, 19 Mar 2026 13:22:17 -0700

On 3/19/2026 4:14 PM, Boqun Feng wrote:
> On Thu, Mar 19, 2026 at 07:41:06PM +0100, Kumar Kartikeya Dwivedi wrote:
>> On Thu, 19 Mar 2026 at 18:27, Boqun Feng <[email protected]> wrote:
>>>
>>> On Thu, Mar 19, 2026 at 05:59:40PM +0100, Kumar Kartikeya Dwivedi wrote:
>>>> On Thu, 19 Mar 2026 at 17:48, Boqun Feng <[email protected]> wrote:
>>>>>
>>>>> On Thu, Mar 19, 2026 at 05:33:50PM +0100, Sebastian Andrzej Siewior wrote:
>>>>>> On 2026-03-19 09:27:59 [-0700], Boqun Feng wrote:
>>>>>>> On Thu, Mar 19, 2026 at 10:03:15AM +0100, Sebastian Andrzej Siewior 
>>>>>>> wrote:
>>>>>>>> Please just use the queue_delayed_work() with a delay >0.
>>>>>>>>
>>>>>>>
>>>>>>> That doesn't work since queue_delayed_work() with a positive delay will
>>>>>>> still acquire timer base lock, and we can have BPF instrument with timer
>>>>>>> base lock held i.e. calling call_srcu() with timer base lock.
>>>>>>>
>>>>>>> irq_work on the other hand doesn't use any locking.
>>>>>>
>>>>>> Could we please restrict BPF somehow so it does roam free? It is
>>>>>> absolutely awful to have irq_work() in call_srcu() just because it
>>>>>> might acquire locks.
>>>>>>
>>>>>
>>>>> I agree it's not RCU's fault ;-)
>>>>>
>>>>> I guess it'll be difficult to restrict BPF, however maybe BPF can call
>>>>> call_srcu() in irq_work instead? Or a more systematic defer mechanism
>>>>> that allows BPF to defer any lock holding functions to a different
>>>>> context. (We have a similar issue that BPF cannot call kfree_rcu() in
>>>>> some cases IIRC).
>>>>>
>>>>> But we need to fix this in v7.0, so this short-term fix is still needed.
>>>>>
>>>>
>>>> I don't think this is an option, even longer term. We already do it
>>>> when it's incorrect to invoke call_rcu() or any other API in a
>>>> specific context (e.g., NMI, where we punt it using irq_work).
>>>> However, the case reported in this thread is different. It was an
>>>> existing user which worked fine before but got broken now. We were
>>>> using call_rcu_tasks_trace() just fine in scx callbacks where rq->lock
>>>> is held before, so the conversion underneath to call_srcu() should
>>>> continue to remain transparent in this respect.
>>>>
>>>
>>> I'm not sure that's a real argument here, kernel doesn't have a stable
>>> internal API, which allows developers to refactor the code into a saner
>>> way. There are currently multiple issues that suggest we may need a
>>> defer mechanism for BPF core, and if it makes the code more easier to
>>> reason about then why not? Think about it like a process that we learn
>>> about all the defer patterns that BPF currently needs and wrap them in a
>>> nice and maintainable way.
>>
>> This is all right in theory, but I don't understand how your
>> theoretical deferral mechanism for BPF will help here in the case
>> we're discussing, or is even appealing.
>>
>> How do we decide when to defer? Will we annotate all locks that can be
>> held by RCU internals to be able to check if they are held (on the
>> current cpu, which is non-trivial except by maintaining a held lock
>> table, testing the locked bit is too conservative), and then deferring
>> the call_srcu() from the caller in BPF? What if you gain new locks? It
>> doesn't seem practical to me. Plus it pushes the burden of detection
>> and deferral to the caller, making everything more complicated and
>> error-prone.
>>
> 
> My suggestion would be: deferring all call_srcu()s that in BPF
> core. [...]

isn't one of the issues is that BPF is using call_rcu_tasks_trace() which is now
internally using call_srcu? So whether other parts of BPF use call_srcu() or
not, the issue still stands AFAICS.

I think we have to fix RCU tasks trace, one way or the other.

Or did I miss something?

thanks,

--
Joel Fernandes
Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT

Reply via email to