On 7/7/2025 10:04 AM, Paul E. McKenney wrote:
> On Mon, Jul 07, 2025 at 01:26:56PM +0000, [email protected] wrote:
>>>
>>> Signed-off-by: Joel Fernandes <[email protected]>
>>>
>>> ---
>>>
>>>  kernel/rcu/tree.h | 11 ++++++++++-
>>>
>>>  kernel/rcu/tree_plugin.h | 29 ++++++++++++++++++++++-------
>>>
>>>  2 files changed, 32 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
>>>
>>> index 3830c19cf2f6..f8f612269e6e 100644
>>>
>>> --- a/kernel/rcu/tree.h
>>>
>>> +++ b/kernel/rcu/tree.h
>>>
>>> @@ -174,6 +174,15 @@ struct rcu_snap_record {
>>>
>>>  unsigned long jiffies; /* Track jiffies value */
>>>
>>>  };
>>>
>>>  
>>>
>>> +/*
>>>
>>> + * The IRQ work (deferred_qs_iw) is used by RCU to get scheduler's 
>>> attention.
>>>
>>> + * It can be in one of the following states:
>>>
>>> + * - DEFER_QS_IDLE: An IRQ work was never scheduled.
>>>
>>> + * - DEFER_QS_PENDING: An IRQ work was scheduler but never run.
>>>
>>> + */
>>>
>>> +#define DEFER_QS_IDLE 0
>>>
>>> +#define DEFER_QS_PENDING 1
>>>
>>> +
>>>
>>>  /* Per-CPU data for read-copy update. */
>>>
>>>  struct rcu_data {
>>>
>>>  /* 1) quiescent-state and grace-period handling : */
>>>
>>> @@ -192,7 +201,7 @@ struct rcu_data {
>>>
>>>  /* during and after the last grace */
>>>
>>>  /* period it is aware of. */
>>>
>>>  struct irq_work defer_qs_iw; /* Obtain later scheduler attention. */
>>>
>>> - bool defer_qs_iw_pending; /* Scheduler attention pending? */
>>>
>>> + int defer_qs_iw_pending; /* Scheduler attention pending? */
>>>
>>>  struct work_struct strict_work; /* Schedule readers for strict GPs. */
>>>
>>>  
>>>
>>>  /* 2) batch handling */
>>>
>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>>>
>>> index dd1c156c1759..baf57745b42f 100644
>>>
>>> --- a/kernel/rcu/tree_plugin.h
>>>
>>> +++ b/kernel/rcu/tree_plugin.h
>>>
>>> @@ -486,13 +486,16 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct 
>>> *t, unsigned long flags)
>>>
>>>  struct rcu_node *rnp;
>>>
>>>  union rcu_special special;
>>>
>>>  
>>>
>>> + rdp = this_cpu_ptr(&rcu_data);
>>>
>>> + if (rdp->defer_qs_iw_pending == DEFER_QS_PENDING)
>>>
>>> + rdp->defer_qs_iw_pending = DEFER_QS_IDLE;
>>>
>>> +
>>>
>>>  /*
>>>
>>>  * If RCU core is waiting for this CPU to exit its critical section,
>>>
>>>  * report the fact that it has exited. Because irqs are disabled,
>>>
>>>  * t->rcu_read_unlock_special cannot change.
>>>
>>>  */
>>>
>>>  special = t->rcu_read_unlock_special;
>>>
>>> - rdp = this_cpu_ptr(&rcu_data);
>>>
>>>  if (!special.s && !rdp->cpu_no_qs.b.exp) {
>>>
>>>  local_irq_restore(flags);
>>>
>>>  return;
>>>
>>> @@ -623,12 +626,24 @@ notrace void rcu_preempt_deferred_qs(struct 
>>> task_struct *t)
>>>
>>>  */
>>>
>>>  static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
>>>
>>>  {
>>>
>>> - unsigned long flags;
>>>
>>> - struct rcu_data *rdp;
>>>
>>> + volatile unsigned long flags;
>>>
>>> + struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>>>
>>>  
>>>
>>> - rdp = container_of(iwp, struct rcu_data, defer_qs_iw);
>>>
>>>  local_irq_save(flags);
>>>
>>> - rdp->defer_qs_iw_pending = false;
>>>
>>> +
>>>
>>> + /*
>>>
>>> + * Requeue the IRQ work on next unlock in following situation:
>>>
>>> + * 1. rcu_read_unlock() queues IRQ work (state -> DEFER_QS_PENDING)
>>>
>>> + * 2. CPU enters new rcu_read_lock()
>>>
>>> + * 3. IRQ work runs but cannot report QS due to rcu_preempt_depth() > 0
>>>
>>> + * 4. rcu_read_unlock() does not re-queue work (state still PENDING)
>>>
>>> + * 5. Deferred QS reporting does not happen.
>>>
>>> + */
>>>
>>> + if (rcu_preempt_depth() > 0) {
>>
>>
>> For Preempt-RT kernels, the rcu_preempt_deferred_qs_handler() be invoked
>> in per-cpu irq_work kthreads, the return value of rcu_preempt_depth()
>> may always be 0, should we use IRQ_WORK_INIT_HARD() to initialize 
>> defer_qs_iw?
> 
> It sure does look like we need "||" rather than "&&" here in
> rcu_read_unlock_special():
> 
>       if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) &&
>           IS_ENABLED(CONFIG_PREEMPT_RT))
>               rdp->defer_qs_iw = IRQ_WORK_INIT_HARD(
>                                       rcu_preempt_deferred_qs_handler);
>       else
>               init_irq_work(&rdp->defer_qs_iw,
>                             rcu_preempt_deferred_qs_handler);
> 
Yes, or could we always IRQ_WORK_INIT_HARD()?

For more context, git blame says Z had looked at this before and added
IRQ_WORK_INIT_HARD:

commit f596e2ce1c0f250bb3ecc179f611be37e862635f
Author: Zqiang <[email protected]>
Date:   Mon Apr 4 07:59:32 2022 +0800

    rcu: Use IRQ_WORK_INIT_HARD() to avoid rcu_read_unlock() hangs

    When booting kernels built with both CONFIG_RCU_STRICT_GRACE_PERIOD=y
    and CONFIG_PREEMPT_RT=y, the rcu_read_unlock_special() function's
    invocation of irq_work_queue_on() the init_irq_work() causes the
    rcu_preempt_deferred_qs_handler() function to work execute in SCHED_FIFO
    irq_work kthreads.  Because rcu_read_unlock_special() is invoked on each
    rcu_read_unlock() in such kernels, the amount of work just keeps piling
    up, resulting in a boot-time hang.

    This commit therefore avoids this hang by using IRQ_WORK_INIT_HARD()
    instead of init_irq_work(), but only in kernels built with both
    CONFIG_PREEMPT_RT=y and CONFIG_RCU_STRICT_GRACE_PERIOD=y.

    Signed-off-by: Zqiang <[email protected]>
    Signed-off-by: Paul E. McKenney <[email protected]>

Btw, once we conclude discussion of how to handle it, if Z could send a patch
based on my rcu/next branch [1], I could apply it to post for further
testing/review.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=rcu/next

thanks,

 - Joel


Reply via email to