On 7/7/2025 10:04 AM, Paul E. McKenney wrote:
> On Mon, Jul 07, 2025 at 01:26:56PM +0000, [email protected] wrote:
>>>
>>> Signed-off-by: Joel Fernandes <[email protected]>
>>>
>>> ---
>>>
>>> kernel/rcu/tree.h | 11 ++++++++++-
>>>
>>> kernel/rcu/tree_plugin.h | 29 ++++++++++++++++++++++-------
>>>
>>> 2 files changed, 32 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
>>>
>>> index 3830c19cf2f6..f8f612269e6e 100644
>>>
>>> --- a/kernel/rcu/tree.h
>>>
>>> +++ b/kernel/rcu/tree.h
>>>
>>> @@ -174,6 +174,15 @@ struct rcu_snap_record {
>>>
>>> unsigned long jiffies; /* Track jiffies value */
>>>
>>> };
>>>
>>>
>>>
>>> +/*
>>>
>>> + * The IRQ work (deferred_qs_iw) is used by RCU to get scheduler's
>>> attention.
>>>
>>> + * It can be in one of the following states:
>>>
>>> + * - DEFER_QS_IDLE: An IRQ work was never scheduled.
>>>
>>> + * - DEFER_QS_PENDING: An IRQ work was scheduler but never run.
>>>
>>> + */
>>>
>>> +#define DEFER_QS_IDLE 0
>>>
>>> +#define DEFER_QS_PENDING 1
>>>
>>> +
>>>
>>> /* Per-CPU data for read-copy update. */
>>>
>>> struct rcu_data {
>>>
>>> /* 1) quiescent-state and grace-period handling : */
>>>
>>> @@ -192,7 +201,7 @@ struct rcu_data {
>>>
>>> /* during and after the last grace */
>>>
>>> /* period it is aware of. */
>>>
>>> struct irq_work defer_qs_iw; /* Obtain later scheduler attention. */
>>>
>>> - bool defer_qs_iw_pending; /* Scheduler attention pending? */
>>>
>>> + int defer_qs_iw_pending; /* Scheduler attention pending? */
>>>
>>> struct work_struct strict_work; /* Schedule readers for strict GPs. */
>>>
>>>
>>>
>>> /* 2) batch handling */
>>>
>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>>>
>>> index dd1c156c1759..baf57745b42f 100644
>>>
>>> --- a/kernel/rcu/tree_plugin.h
>>>
>>> +++ b/kernel/rcu/tree_plugin.h
>>>
>>> @@ -486,13 +486,16 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct
>>> *t, unsigned long flags)
>>>
>>> struct rcu_node *rnp;
>>>
>>> union rcu_special special;
>>>
>>>
>>>
>>> + rdp = this_cpu_ptr(&rcu_data);
>>>
>>> + if (rdp->defer_qs_iw_pending == DEFER_QS_PENDING)
>>>
>>> + rdp->defer_qs_iw_pending = DEFER_QS_IDLE;
>>>
>>> +
>>>
>>> /*
>>>
>>> * If RCU core is waiting for this CPU to exit its critical section,
>>>
>>> * report the fact that it has exited. Because irqs are disabled,
>>>
>>> * t->rcu_read_unlock_special cannot change.
>>>
>>> */
>>>
>>> special = t->rcu_read_unlock_special;
>>>
>>> - rdp = this_cpu_ptr(&rcu_data);
>>>
>>> if (!special.s && !rdp->cpu_no_qs.b.exp) {
>>>
>>> local_irq_restore(flags);
>>>
>>> return;
>>>
>>> @@ -623,12 +626,24 @@ notrace void rcu_preempt_deferred_qs(struct
>>> task_struct *t)
>>>
>>> */
>>>
>>> static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
>>>
>>> {
>>>
>>> - unsigned long flags;
>>>
>>> - struct rcu_data *rdp;
>>>
>>> + volatile unsigned long flags;
>>>
>>> + struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>>>
>>>
>>>
>>> - rdp = container_of(iwp, struct rcu_data, defer_qs_iw);
>>>
>>> local_irq_save(flags);
>>>
>>> - rdp->defer_qs_iw_pending = false;
>>>
>>> +
>>>
>>> + /*
>>>
>>> + * Requeue the IRQ work on next unlock in following situation:
>>>
>>> + * 1. rcu_read_unlock() queues IRQ work (state -> DEFER_QS_PENDING)
>>>
>>> + * 2. CPU enters new rcu_read_lock()
>>>
>>> + * 3. IRQ work runs but cannot report QS due to rcu_preempt_depth() > 0
>>>
>>> + * 4. rcu_read_unlock() does not re-queue work (state still PENDING)
>>>
>>> + * 5. Deferred QS reporting does not happen.
>>>
>>> + */
>>>
>>> + if (rcu_preempt_depth() > 0) {
>>
>>
>> For Preempt-RT kernels, the rcu_preempt_deferred_qs_handler() be invoked
>> in per-cpu irq_work kthreads, the return value of rcu_preempt_depth()
>> may always be 0, should we use IRQ_WORK_INIT_HARD() to initialize
>> defer_qs_iw?
>
> It sure does look like we need "||" rather than "&&" here in
> rcu_read_unlock_special():
>
> if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) &&
> IS_ENABLED(CONFIG_PREEMPT_RT))
> rdp->defer_qs_iw = IRQ_WORK_INIT_HARD(
> rcu_preempt_deferred_qs_handler);
> else
> init_irq_work(&rdp->defer_qs_iw,
> rcu_preempt_deferred_qs_handler);
>
Yes, or could we always IRQ_WORK_INIT_HARD()?
For more context, git blame says Z had looked at this before and added
IRQ_WORK_INIT_HARD:
commit f596e2ce1c0f250bb3ecc179f611be37e862635f
Author: Zqiang <[email protected]>
Date: Mon Apr 4 07:59:32 2022 +0800
rcu: Use IRQ_WORK_INIT_HARD() to avoid rcu_read_unlock() hangs
When booting kernels built with both CONFIG_RCU_STRICT_GRACE_PERIOD=y
and CONFIG_PREEMPT_RT=y, the rcu_read_unlock_special() function's
invocation of irq_work_queue_on() the init_irq_work() causes the
rcu_preempt_deferred_qs_handler() function to work execute in SCHED_FIFO
irq_work kthreads. Because rcu_read_unlock_special() is invoked on each
rcu_read_unlock() in such kernels, the amount of work just keeps piling
up, resulting in a boot-time hang.
This commit therefore avoids this hang by using IRQ_WORK_INIT_HARD()
instead of init_irq_work(), but only in kernels built with both
CONFIG_PREEMPT_RT=y and CONFIG_RCU_STRICT_GRACE_PERIOD=y.
Signed-off-by: Zqiang <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Btw, once we conclude discussion of how to handle it, if Z could send a patch
based on my rcu/next branch [1], I could apply it to post for further
testing/review.
[1]
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=rcu/next
thanks,
- Joel