This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:
apport-collect 1810998 and then change the status of the bug to 'Confirmed'. If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'. This change has been made by an automated script, maintained by the Ubuntu Kernel Team. ** Changed in: linux (Ubuntu) Status: New => Incomplete ** Tags added: bionic -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1810998 Title: CPU hard lockup with rigorous writes to NVMe drive Status in linux package in Ubuntu: Incomplete Bug description: [NOTE] * Patches will be sent to the kernel-team mailing list once the test kernel has been verified by the reporter. [Impact] * Users may experience cpu hard lockups when performing rigorous writes to NVMe drives. * The fix addresses an scheduling issue in the original implementation of wbt/writeback throttling * The fix is commit 2887e41b910b ("blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait"), plus its fix commit 38cfb5a45ee0 ("blk-wbt: improve waking of tasks"). * There are additional commits to help with a cleaner backport and future maintenance: - Cosmic: 8 clean cherry picks. - Bionic: So, of 13 commits, 9 are clean cherry-picks and 4 backports, which are just changes to context lines (i.e. refresh) without any functional changes in the backport itself. [Test Case] * This command has been reported to reproduce the problem: $ sudo iozone -R -s 5G -r 1m -S 2048 -i 0 -G -c -o -l 128 -u 128 -t 128 * It generates stack traces as included below. [Regression Potential] * The commits have been verified for fixes in later commits in linux-next as of 2019-01-08 and all known fix commits are in. * The regression potential is mostly contained in the writeback throttling (block/blk-wbt.*), as almost all of the 13 patches change exclusively that, except for 4 of them (2 of which are sysfs): - blk-rq-qos: refactor out common elements of blk-wbt (block/) - block: Protect less code with sysfs_lock in blk_{un,}register_queue() (blk-sysfs.c) - block: Protect less code with sysfs_lock in blk_{un,}register_queue() (blk-{mq-}sysfs.c) - block: pass struct request instead of struct blk_issue_stat to wbt (block/, still mostly blk-wbt.*) [Other Info] * Alternatively, it is probably possible to introduce just the two commits that fix this with some changes to their code in the backport, but since the 'blk-rq-qos: refactor ..' commit may become a dependency for many additional/future fixes, it seemed interesting to pull it in earlier in the 18.04 branch. * The problem has been introduced with the blk-wbt mechanism, in v4.10-rc1, and the fix commits in v4.19-rc1 and -rc2, so only Bionic and Cosmic needs this. [Stack Traces] [ 393.628647] NMI watchdog: Watchdog detected hard LOCKUP on cpu 30 ... [ 393.628704] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu ... [ 393.628720] Call Trace: [ 393.628721] <IRQ> [ 393.628724] enqueue_task_fair+0x6c/0x7f0 [ 393.628726] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150 [ 393.628728] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150 [ 393.628731] activate_task+0x57/0xc0 [ 393.628735] ? sched_clock+0x9/0x10 [ 393.628736] ? sched_clock+0x9/0x10 [ 393.628738] ttwu_do_activate+0x49/0x90 [ 393.628739] try_to_wake_up+0x1df/0x490 [ 393.628741] default_wake_function+0x12/0x20 [ 393.628743] autoremove_wake_function+0x12/0x40 [ 393.628744] __wake_up_common+0x73/0x130 [ 393.628745] __wake_up_common_lock+0x80/0xc0 [ 393.628746] __wake_up+0x13/0x20 [ 393.628749] __wbt_done.part.21+0xa4/0xb0 [ 393.628749] wbt_done+0x72/0xa0 [ 393.628753] blk_mq_free_request+0xca/0x1a0 [ 393.628755] blk_mq_end_request+0x48/0x90 [ 393.628760] nvme_complete_rq+0x23/0x120 [nvme_core] [ 393.628763] nvme_pci_complete_rq+0x7a/0x130 [nvme] [ 393.628764] __blk_mq_complete_request+0xd2/0x140 [ 393.628766] blk_mq_complete_request+0x18/0x20 [ 393.628767] nvme_process_cq+0xe1/0x1b0 [nvme] [ 393.628768] nvme_irq+0x23/0x50 [nvme] [ 393.628772] __handle_irq_event_percpu+0x44/0x1a0 [ 393.628773] handle_irq_event_percpu+0x32/0x80 [ 393.628774] handle_irq_event+0x3b/0x60 [ 393.628778] handle_edge_irq+0x7c/0x190 [ 393.628779] handle_irq+0x20/0x30 [ 393.628783] do_IRQ+0x46/0xd0 [ 393.628784] common_interrupt+0x84/0x84 [ 393.628785] </IRQ> ... [ 393.628794] ? cpuidle_enter_state+0x97/0x2f0 [ 393.628796] cpuidle_enter+0x17/0x20 [ 393.628797] call_cpuidle+0x23/0x40 [ 393.628798] do_idle+0x18c/0x1f0 [ 393.628799] cpu_startup_entry+0x73/0x80 [ 393.628802] start_secondary+0x1a6/0x200 [ 393.628804] secondary_startup_64+0xa5/0xb0 [ 393.628805] Code: ... [ 405.981597] nvme nvme1: I/O 393 QID 6 timeout, completion polled [ 435.597209] INFO: rcu_sched detected stalls on CPUs/tasks: [ 435.602858] 30-...0: (1 GPs behind) idle=e26/1/0 softirq=6834/6834 fqs=4485 [ 435.610203] (detected by 8, t=15005 jiffies, g=6396, c=6395, q=146818) [ 435.617025] Sending NMI from CPU 8 to CPUs 30: [ 435.617029] NMI backtrace for cpu 30 [ 435.617031] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu ... [ 435.617047] Call Trace: [ 435.617048] <IRQ> [ 435.617051] enqueue_entity+0x9f/0x6b0 [ 435.617053] enqueue_task_fair+0x6c/0x7f0 [ 435.617056] activate_task+0x57/0xc0 [ 435.617059] ? sched_clock+0x9/0x10 [ 435.617060] ? sched_clock+0x9/0x10 [ 435.617061] ttwu_do_activate+0x49/0x90 [ 435.617063] try_to_wake_up+0x1df/0x490 [ 435.617065] default_wake_function+0x12/0x20 [ 435.617067] autoremove_wake_function+0x12/0x40 [ 435.617068] __wake_up_common+0x73/0x130 [ 435.617069] __wake_up_common_lock+0x80/0xc0 [ 435.617070] __wake_up+0x13/0x20 [ 435.617073] __wbt_done.part.21+0xa4/0xb0 [ 435.617074] wbt_done+0x72/0xa0 [ 435.617077] blk_mq_free_request+0xca/0x1a0 [ 435.617079] blk_mq_end_request+0x48/0x90 [ 435.617084] nvme_complete_rq+0x23/0x120 [nvme_core] [ 435.617087] nvme_pci_complete_rq+0x7a/0x130 [nvme] [ 435.617088] __blk_mq_complete_request+0xd2/0x140 [ 435.617090] blk_mq_complete_request+0x18/0x20 [ 435.617091] nvme_process_cq+0xe1/0x1b0 [nvme] [ 435.617093] nvme_irq+0x23/0x50 [nvme] [ 435.617096] __handle_irq_event_percpu+0x44/0x1a0 [ 435.617097] handle_irq_event_percpu+0x32/0x80 [ 435.617098] handle_irq_event+0x3b/0x60 [ 435.617101] handle_edge_irq+0x7c/0x190 [ 435.617102] handle_irq+0x20/0x30 [ 435.617106] do_IRQ+0x46/0xd0 [ 435.617107] common_interrupt+0x84/0x84 [ 435.617108] </IRQ> ... [ 435.617117] ? cpuidle_enter_state+0x97/0x2f0 [ 435.617118] cpuidle_enter+0x17/0x20 [ 435.617119] call_cpuidle+0x23/0x40 [ 435.617121] do_idle+0x18c/0x1f0 [ 435.617122] cpu_startup_entry+0x73/0x80 [ 435.617125] start_secondary+0x1a6/0x200 [ 435.617127] secondary_startup_64+0xa5/0xb0 [ 435.617128] Code: ... To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1810998/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp