Hi David, hi Kunwu,
thanks a lot for the suggestions.
I reran the reproducer with
CONFIG_PROVE_LOCKING=y, CONFIG_LOCKDEP=y, CONFIG_DEBUG_LOCK_ALLOC=y enabled.
Based on the lockdep-enabled run, here is what I can clarify:
Lockdep does not report any lock inversion, recursive locking, or
circular dependency.
The samples showing __mod_zone_page_state() do not appear to indicate
a blocking point; this frame indeed seems to be just where the task
was sampled.
>From the timeline of the reports, the earliest problematic behavior
appears before the MM/LRU-heavy paths.
In the first hung-task report, multiple repro1 threads are already blocked in:
down_write()
└─ rwbase_write_lock()
└─ __rt_mutex_slowlock_locked()
└─ rt_mutex_schedule()
via the do_vfs_ioctl() → perf_fasync() path, and are in D state for
more than 143 seconds at that point.
After several threads are stuck there, the system degrades further:
other threads remain in R state, spending long, uninterrupted time in
MM allocation / LRU paths
(alloc_pages(), get_page_from_freelist(), __handle_mm_fault()),
without hitting reschedule points.
This then leads to RCU preempt stalls, and eventually workqueue lockups
(e.g. vmstat_shepherd, do_cache_clean, wb_workfn).
Lockdep’s “show all locks held” output does not show the blocked repro1
threads holding any MM/LRU/zone locks themselves; they typically only hold
the filesystem mutex at that point, which suggests the contended RT rwsem
is held elsewhere.
Overall, this currently looks less like a single blocking bug in
__mod_zone_page_state(), and more like a PREEMPT_RT-specific
starvation scenario,
where long-held RT rwsems in the ioctl/perf path combined with long CPU-bound
MM/LRU execution amplify into RCU starvation and workqueue lockups.
Below is the earliest hung-task report from the lockdep-enabled run
for reference:
[386.499937] INFO: task repro1:2066 blocked for more than 143 seconds.
[386.499956] Not tainted 6.19.0-rc7 #4
[386.499964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[386.499970] task:repro1 state:D stack:28400 pid:2066 tgid:2066 ppid:293 2
[386.500022] Call Trace:
[386.500027] <TASK>
[386.500037] __schedule+0x1198/0x3f00
[386.500069] ? io_schedule_timeout+0x80/0x80
[386.500088] ? kvm_sched_clock_read+0x16/0x20
[386.500111] ? local_clock_noinstr+0xf/0xc0
[386.500125] ? __rt_mutex_slowlock_locked.constprop.0+0xecd/0x30c0
[386.500148] rt_mutex_schedule+0x9f/0xe0
[386.500171] __rt_mutex_slowlock_locked.constprop.0+0xedc/0x30c0
[386.500197] ? down_write_trylock+0x1a0/0x1a0
[386.500222] ? lock_acquired+0xbd/0x340
[386.500245] rwbase_write_lock+0x744/0xa80
[386.500266] ? perf_fasync+0xc0/0x130
[386.500284] ? rt_mutex_adjust_prio_chain.isra.0+0x3240/0x3240
[386.500304] ? kvm_sched_clock_read+0x16/0x20
[386.500329] ? perf_fasync+0xc0/0x130
[386.500344] ? local_clock+0x10/0x20
[386.500364] ? lock_contended+0x189/0x420
[386.500385] down_write+0x6e/0x1e0
[386.500405] perf_fasync+0xc0/0x130
[386.500421] ? perf_cgroup_css_free+0x50/0x50
[386.500440] do_vfs_ioctl+0x9b9/0x1480
[386.500457] ? lock_vma_under_rcu+0x7ee/0xd90
[386.500475] ? ioctl_file_clone+0xf0/0xf0
[386.500490] ? lock_is_held_type+0xa0/0x110
[386.500506] ? handle_mm_fault+0x5a6/0x9d0
[386.500526] ? kvm_sched_clock_read+0x16/0x20
[386.502053] ? local_clock_noinstr+0xf/0xc0
[386.502073] ? handle_mm_fault+0x5a6/0x9d0
[386.502092] ? exc_page_fault+0xb0/0x180
[386.502106] ? kvm_sched_clock_read+0x16/0x20
[386.502129] ? local_clock_noinstr+0xf/0xc0
[386.502142] ? exc_page_fault+0xb0/0x180
[386.502154] ? local_clock+0x10/0x20
[386.502174] ? lock_release+0x258/0x3c0
[386.502196] ? irqentry_exit+0xf0/0x6d0
[386.502213] __x64_sys_ioctl+0x112/0x220
[386.502232] do_syscall_64+0xc3/0x430
[386.502253] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[386.502269] RIP: 0033:0x7f62f7922fc9
[386.502351] </TASK>
[386.502357] INFO: task repro1:2072 blocked for more than 143 seconds.
[386.502366] Not tainted 6.19.0-rc7 #4
[386.502373] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[386.502378] task:repro1 state:D stack:28400 pid:2072 tgid:2072 ppid:294 2
[386.502427] Call Trace:
[386.502431] <TASK>
[386.502439] __schedule+0x1198/0x3f00
[386.502463] ? io_schedule_timeout+0x80/0x80
[386.502483] ? mark_held_locks+0x50/0x80
[386.502505] rt_mutex_schedule+0x9f/0xe0
[386.502527] __rt_mutex_slowlock_locked.constprop.0+0xedc/0x30c0
[386.503218] ? down_write_trylock+0x1a0/0x1a0
[386.503246] ? lock_acquired+0xbd/0x340
[386.503269] rwbase_write_lock+0x744/0xa80
[386.503290] ? perf_fasync+0xc0/0x130
[386.503306] ? rt_mutex_adjust_prio_chain.isra.0+0x3240/0x3240
[386.503327] ? kvm_sched_clock_read+0x16/0x20
[386.503351] ? perf_fasync+0xc0/0x130
[386.503366] ? local_clock+0x10/0x20
[386.503386] ? lock_contended+0x189/0x420
[386.503407] down_write+0x6e/0x1e0
[386.503427] perf_fasync+0xc0/0x130
[386.503442] ? perf_cgroup_css_free+0x50/0x50
[386.503461] do_vfs_ioctl+0x9b9/0x1480
[386.503476] ? lock_vma_under_rcu+0x7ee/0xd90
[386.503493] ? ioctl_file_clone+0xf0/0xf0
[386.503508] ? lock_is_held_type+0xa0/0x110
[386.503524] ? handle_mm_fault+0x5a6/0x9d0
[386.503543] ? kvm_sched_clock_read+0x16/0x20
[386.504012] ? local_clock_noinstr+0xf/0xc0
[386.504049] ? exc_page_fault+0xb0/0x180
[386.504312] ? irqentry_exit+0xf0/0x6d0
[386.504330] __x64_sys_ioctl+0x112/0x220
[386.504369] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[386.504464] </TASK>[386.504470] INFO: task repro1:2073 blocked for
more than 143 seconds.
[386.504491] task:repro1 state:D stack:28400 pid:2073 tgid:2073 ppid:292 2
[386.504540] Call Trace:
[386.504544] <TASK>
[386.505300] __schedule+0x1198/0x3f00
[386.505347] ? mark_held_locks+0x50/0x80
[386.505369] rt_mutex_schedule+0x9f/0xe0
[386.505391] __rt_mutex_slowlock_locked.constprop.0+0xedc/0x30c0
[386.505464] rwbase_write_lock+0x744/0xa80
[386.505988] down_write+0x6e/0x1e0
[386.506042] do_vfs_ioctl+0x9b9/0x1480
[386.506301] __x64_sys_ioctl+0x112/0x220
[386.506340] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[386.506434] </TASK>
[386.506442] Showing all locks held in the system:
[386.506447] 4 locks held by pr/legacy/16:
[386.506456] 1 lock held by khungtaskd/37:
[386.506464] #0: ffffffff85041540 (rcu_read_lock){....}-{1:3}
[386.506503] 1 lock held by in:imklog/196:
[386.506513] 1 lock held by repro1/2040:
[386.506522] 1 lock held by repro1/2066:
[386.506532] #0: ffff88800784bc50 (&sb->s_type->i_mutex_key#17)
[386.507276] 1 lock held by repro1/2072:
[386.507284] #0: ffff88800784bc50 (&sb->s_type->i_mutex_key#17)
[386.507321] 1 lock held by repro1/2073:
[386.507328] #0: ffff88800784bc50 (&sb->s_type->i_mutex_key#17)
[427.459692] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0
nice=0 stuck for 40s!
[427.459779] workqueue events:
[427.459809] pending: vmstat_shepherd, e1000_watchdog
[427.460020] workqueue events_freezable_pwr_efficient:
[427.460020] in-flight: disk_events_workfn
[427.460052] workqueue writeback:
[427.460084] in-flight: wb_workfn
[427.460231] Showing backtraces of running workers in stalled
CPU-bound worker pools
Message from syslogd@syzkaller at Feb 6 10:27:59 ... kernel:[
427.459692] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0
nice=0 stuc!
Thanks
Zw Tang