On Sun, Dec 08, 2024 at 12:48:29AM +0800, MengEn Sun wrote:
> Hi,
>
> The kernel version that we are using is 5.4.241. Recently, we
> encountered a case of rcu_sched thread soft lock. The kernel stack
> is as follows:
>
> crash> bt
> PID: 10 TASK: ffff8a83cc610000 CPU: 14 COMMAND: "rcu_sched"
> #0 [ffffc90000a64d50] machine_kexec at ffffffff8105714c
> #1 [ffffc90000a64da0] __crash_kexec at ffffffff8112c9bd
> #2 [ffffc90000a64e68] panic at ffffffff819e29d8
> #3 [ffffc90000a64ee8] watchdog_timer_fn.cold.9 at ffffffff819e9a3f
> #4 [ffffc90000a64f18] __hrtimer_run_queues at ffffffff8110b5d0
> #5 [ffffc90000a64f78] hrtimer_interrupt at ffffffff8110bdf0
> #6 [ffffc90000a64fd8] smp_apic_timer_interrupt at ffffffff81c0268a
> #7 [ffffc90000a64ff0] apic_timer_interrupt at ffffffff81c01c0f
> --- <IRQ stack> ---
> #8 [ffffc9000006bdd8] apic_timer_interrupt at ffffffff81c01c0f
> [exception RIP: _raw_spin_unlock_irqrestore+14]
> #9 [ffffc9000006be80] force_qs_rnp at ffffffff810fa0b0
> #10 [ffffc9000006beb8] rcu_gp_kthread at ffffffff810fd9b6
> #11 [ffffc9000006bf10] kthread at ffffffff8109e40c
> #12 [ffffc9000006bf50] ret_from_fork at ffffffff81c00255
>
> It seems that it is because a CPU did not report its QS in time.
> I see from vmcore that the 141 cpu did not report its QS
>
>
> The current process on the 141 CPU is as follows, it looks like 141
> has detected the rcu stall and the print serial port log
>
> crash> bt -c 141
> PID: 594039 TASK: ffff8a699dab8000 CPU: 141 COMMAND: "stress-ng-zombi"
> #0 [fffffe0001f70e40] crash_nmi_callback at ffffffff8104b54f
> #1 [fffffe0001f70e60] nmi_handle at ffffffff8101cf92
> #2 [fffffe0001f70eb8] default_do_nmi at ffffffff8101d16e
> #3 [fffffe0001f70ed8] do_nmi at ffffffff8101d351
> #4 [fffffe0001f70ef0] end_repeat_nmi at ffffffff81c015f3
> [exception RIP: vprintk_emit+492]
> --- <NMI exception stack> ---
> #5 [ffffc90002038db0] vprintk_emit at ffffffff810eb19c
> #6 [ffffc90002038e00] printk at ffffffff819e6544
> #7 [ffffc90002038e60] rcu_check_gp_kthread_starvation at ffffffff819e78bd
> #8 [ffffc90002038e88] rcu_sched_clock_irq.cold.84 at ffffffff819e8021
> #9 [ffffc90002038ed0] update_process_times at ffffffff8110a8e4
> #10 [ffffc90002038ee0] tick_sched_handle at ffffffff8111be22
> #11 [ffffc90002038ef8] tick_sched_timer at ffffffff8111c147
> #12 [ffffc90002038f18] __hrtimer_run_queues at ffffffff8110b5d0
> #13 [ffffc90002038f78] hrtimer_interrupt at ffffffff8110bdf0
> #14 [ffffc90002038fd8] smp_apic_timer_interrupt at ffffffff81c0268a
> #15 [ffffc90002038ff0] apic_timer_interrupt at ffffffff81c01c0f
> --- <IRQ stack> ---
> #16 [ffffc90004227c98] apic_timer_interrupt at ffffffff81c01c0f
> [exception RIP: __rb_erase_color+34]
> #17 [ffffc90004227d78] unlink_file_vma at ffffffff81241e0b
> #18 [ffffc90004227da0] free_pgtables at ffffffff8123708e
> #19 [ffffc90004227dd8] exit_mmap at ffffffff81243ea1
> #20 [ffffc90004227e78] mmput at ffffffff81074fd4
> #21 [ffffc90004227e90] do_exit at ffffffff8107d50c
> #22 [ffffc90004227f08] do_group_exit at ffffffff8107de6a
> #23 [ffffc90004227f30] __x64_sys_exit_group at ffffffff8107dee4
> #24 [ffffc90004227f38] do_syscall_64 at ffffffff81002535
> #25 [ffffc90004227f50] entry_SYSCALL_64_after_hwframe at ffffffff81c000a4
>
> I think the reasons are as follows:
> - The rcu_sched needs to lock the rcu node where the 141 cpu is located
> to force the QS operations.
> - However 141 detects rcu stall and locks rcu node to print some serial
> logs.
> - The 141 cpu does touch nmi considering that printing port logs may
> cause system soft or hard locks.
> - However, The 141 cpu does not take into account other cpu, such as
> cpu 14, which may be blocked because it holds the lock of rcu node.
>
> I think printing serial port log is a more expensive operation, can
> we only let the 141 cpu in the critical section of the rcu node lock,
> collect the information it needs to print, and print the log it
> collected after it released the rcu node lock?
Mainline uses printk_deferred_enter() and printk_deferred_exit() to get
this effect. Please see this commit in mainline:
1ecd9d68eb44 ("rcu: Defer printing stall-warning backtrace when holding
rcu_node lock")
And these might also be helpful:
cbe644aa6fe1 ("rcu: Stop stall warning from dumping stacks if grace period
ends")
9650edd9bf1d ("rcu: Finer-grained grace-period-end checks in
rcu_dump_cpu_stacks()")
> I'm not sure if there is a better way to solve this problem?
There might well be. For example, should printk_deferred_enter() and
printk_deferred_exit() prove problematic, it has been suggested that
printk_cpu_sync_get_irqsave() and printk_cpu_sync_put_irqrestore() be
used instead. However, the header comments for these two macros clearly
state that they are very strong medicine with serious side effects.
So: (1) There needs to be a very real problem with the current approach
of printk_deferred_enter() and printk_deferred_exit(), and (2) Careful
analysis, review, and testing would be required.
And there might well be some even better approach. I would of course
welcome additional eyes on this problem, and a good place to start is to
review the commits to this code.
Thanx, Paul