> On Mar 18, 2026, at 9:13 PM, Boqun Feng <[email protected]> wrote: > > On Thu, Mar 19, 2026 at 12:26:38AM +0000, Zqiang wrote: >>> >>>> On Wed, 18 Mar 2026 at 23:15, Boqun Feng <[email protected]> wrote: >>> >>>> >>>> On Wed, Mar 18, 2026 at 02:55:48PM -0700, Boqun Feng wrote: >>>> On Wed, Mar 18, 2026 at 02:52:48PM -0700, Boqun Feng wrote: >>>> [...] >>>>>> Ah so it is an ABBA deadlock, not a ABA self-deadlock. I guess this is a >>>>>> different issue, from the NMI issue? It is more of an issue of calling >>>>>> call_srcu API with scheduler locks held. >>>>>> >>>>>> Something like below I think: >>>>>> >>>>>> CPU A (BPF tracepoint) CPU B (concurrent call_srcu) >>>>>> ---------------------------- ------------------------------------ >>>>>> [1] holds &rq->__lock >>>>>> [2] >>>>>> -> call_srcu >>>>>> -> srcu_gp_start_if_needed >>>>>> -> srcu_funnel_gp_start >>>>>> -> spin_lock_irqsave_ssp_content... >>>>>> -> holds srcu locks >>>>>> >>>>>> [4] calls call_rcu_tasks_trace() [5] srcu_funnel_gp_start (cont..) >>>>>> -> queue_delayed_work >>>>>> -> call_srcu() -> __queue_work() >>>>>> -> srcu_gp_start_if_needed() -> wake_up_worker() >>>>>> -> srcu_funnel_gp_start() -> try_to_wake_up() >>>>>> -> spin_lock_irqsave_ssp_contention() [6] WANTS rq->__lock >>>>>> -> WANTS srcu locks >>>>> >>>>> I see, we can also have a self deadlock even without CPU B, when CPU A >>>>> is going to try_to_wake_up() the a worker on the same CPU. >>>>> >>>>> An interesting observation is that the deadlock can be avoided in >>>>> queue_delayed_work() uses a non-zero delay, that means a timer will be >>>>> armed instead of acquiring the rq lock. >>>>> >>>> >>>> If my observation is correct, then this can probably fix the deadlock >>>> issue with runqueue lock (untested though), but it won't work if BPF >>>> tracepoint can happen with timer base lock held. >>>> >>> Unfortunately it can be, there is at least one tracepoint that is >>> invoked with hrtimer base lock held. >>> Alexei ended up fixing this in the recent past [0]. So I think this >>> would cause trouble too. >>> >>> hrtimer_start_range_ns() -> __hrtimer_start_range_ns() -> >>> remove_timer() -> __remove_hrtimer() -> debug_deactivate() -> >>> trace_hrtimer_cancel(). >>> BPF can attach to such a tracepoint. >> >> Is it possible to use irq_work_queue() to trigger queue_delay_work() >> by checking if ssp == &rcu_tasks_trace_srcu_struct ? >> > > Good call! I didn't do the exact == check, but close: > > https://lore.kernel.org/rcu/[email protected]/ > > ;-)
IMO I am ok with Boqun patch but I do not think the == check should be used. SRCU internals should not special case specific SRCU users, that is a hack :) Thanks! > > Regards, > Boqun > >> Thanks >> Zqiang >> >> >>> >>> [0]: >>> https://lore.kernel.org/bpf/[email protected] >>> >>>> >>>> Regards, >>>> Boqun >>>> >>>> ------> >>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c >>>> index 2328827f8775..a5d67264acb5 100644 >>>> --- a/kernel/rcu/srcutree.c >>>> +++ b/kernel/rcu/srcutree.c >>>> @@ -1061,6 +1061,7 @@ static void srcu_funnel_gp_start(struct srcu_struct >>>> *ssp, struct srcu_data *sdp, >>>> struct srcu_node *snp_leaf; >>>> unsigned long snp_seq; >>>> struct srcu_usage *sup = ssp->srcu_sup; >>>> + bool irqs_were_disabled; >>>> >>>> /* Ensure that snp node tree is fully initialized before traversing it */ >>>> if (smp_load_acquire(&sup->srcu_size_state) < SRCU_SIZE_WAIT_BARRIER) >>>> @@ -1098,6 +1099,7 @@ static void srcu_funnel_gp_start(struct srcu_struct >>>> *ssp, struct srcu_data *sdp, >>>> >>>> /* Top of tree, must ensure the grace period will be started. */ >>>> raw_spin_lock_irqsave_ssp_contention(ssp, &flags); >>>> + irqs_were_disabled = irqs_disabled_flags(flags); >>>> if (ULONG_CMP_LT(sup->srcu_gp_seq_needed, s)) { >>>> /* >>>> * Record need for grace period s. Pair with load >>>> @@ -1118,9 +1120,16 @@ static void srcu_funnel_gp_start(struct srcu_struct >>>> *ssp, struct srcu_data *sdp, >>>> // it isn't. And it does not have to be. After all, it >>>> // can only be executed during early boot when there is only >>>> // the one boot CPU running with interrupts still disabled. >>>> + // >>>> + // If irq was disabled when call_srcu() is called, then we >>>> + // could be in the scheduler path with a runqueue lock held, >>>> + // delay the process_srcu() work 1 more jiffies so we don't go >>>> + // through the kick_pool() -> wake_up_process() path below, and >>>> + // we could avoid deadlock with runqueue lock. >>>> if (likely(srcu_init_done)) >>>> queue_delayed_work(rcu_gp_wq, &sup->work, >>>> - !!srcu_get_delay(ssp)); >>>> + !!srcu_get_delay(ssp) + >>>> + !!irqs_were_disabled); >>>> else if (list_empty(&sup->work.work.entry)) >>>> list_add(&sup->work.work.entry, &srcu_boot_list); >>>> } >>>> >>>
