This bug is missing log files that will aid in diagnosing the problem.
While running an Ubuntu kernel (not a mainline or third-party kernel)
please enter the following command in a terminal window:

apport-collect 1810998

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable
to run this command, please add a comment stating that fact and change
the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the
Ubuntu Kernel Team.

** Changed in: linux (Ubuntu)
       Status: New => Incomplete

** Tags added: bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1810998

Title:
  CPU hard lockup with rigorous writes to NVMe drive

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [NOTE]

   * Patches will be sent to the kernel-team mailing list
     once the test kernel has been verified by the reporter.

  [Impact]

   * Users may experience cpu hard lockups when performing
     rigorous writes to NVMe drives.

   * The fix addresses an scheduling issue in the original
     implementation of wbt/writeback throttling

   * The fix is commit 2887e41b910b ("blk-wbt: Avoid lock
     contention and thundering herd issue in wbt_wait"),
     plus its fix commit 38cfb5a45ee0 ("blk-wbt: improve
     waking of tasks").

   * There are additional commits to help with a cleaner
     backport and future maintenance:
     - Cosmic: 8 clean cherry picks.
     - Bionic: So, of 13 commits, 9 are clean cherry-picks
       and 4 backports, which are just changes to context
       lines (i.e. refresh) without any functional changes
       in the backport itself.

  [Test Case]

   * This command has been reported to reproduce the problem:

     $ sudo iozone -R -s 5G -r 1m -S 2048 -i 0 -G -c -o -l 128 -u 128 -t
  128

   * It generates stack traces as included below.

  [Regression Potential]

   * The commits have been verified for fixes in later commits in
     linux-next as of 2019-01-08 and all known fix commits are in.

   * The regression potential is mostly contained in the writeback
     throttling (block/blk-wbt.*), as almost all of the 13 patches
     change exclusively that, except for 4 of them (2 of which are
     sysfs):

     - blk-rq-qos: refactor out common elements of blk-wbt (block/)
     - block: Protect less code with sysfs_lock in blk_{un,}register_queue() 
(blk-sysfs.c)
     - block: Protect less code with sysfs_lock in blk_{un,}register_queue() 
(blk-{mq-}sysfs.c)
     - block: pass struct request instead of struct blk_issue_stat to wbt 
(block/, still mostly blk-wbt.*)

  [Other Info]

   * Alternatively, it is probably possible to introduce just the
     two commits that fix this with some changes to their code in
     the backport, but since the 'blk-rq-qos: refactor ..' commit
     may become a dependency for many additional/future fixes, it
     seemed interesting to pull it in earlier in the 18.04 branch.

   * The problem has been introduced with the blk-wbt mechanism,
     in v4.10-rc1, and the fix commits in v4.19-rc1 and -rc2,
     so only Bionic and Cosmic needs this.

  [Stack Traces]

  [ 393.628647] NMI watchdog: Watchdog detected hard LOCKUP on cpu 30
  ...
  [ 393.628704] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic 
#21-Ubuntu
  ...
  [ 393.628720] Call Trace:
  [ 393.628721] <IRQ>
  [ 393.628724] enqueue_task_fair+0x6c/0x7f0
  [ 393.628726] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
  [ 393.628728] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
  [ 393.628731] activate_task+0x57/0xc0
  [ 393.628735] ? sched_clock+0x9/0x10
  [ 393.628736] ? sched_clock+0x9/0x10
  [ 393.628738] ttwu_do_activate+0x49/0x90
  [ 393.628739] try_to_wake_up+0x1df/0x490
  [ 393.628741] default_wake_function+0x12/0x20
  [ 393.628743] autoremove_wake_function+0x12/0x40
  [ 393.628744] __wake_up_common+0x73/0x130
  [ 393.628745] __wake_up_common_lock+0x80/0xc0
  [ 393.628746] __wake_up+0x13/0x20
  [ 393.628749] __wbt_done.part.21+0xa4/0xb0
  [ 393.628749] wbt_done+0x72/0xa0
  [ 393.628753] blk_mq_free_request+0xca/0x1a0
  [ 393.628755] blk_mq_end_request+0x48/0x90
  [ 393.628760] nvme_complete_rq+0x23/0x120 [nvme_core]
  [ 393.628763] nvme_pci_complete_rq+0x7a/0x130 [nvme]
  [ 393.628764] __blk_mq_complete_request+0xd2/0x140
  [ 393.628766] blk_mq_complete_request+0x18/0x20
  [ 393.628767] nvme_process_cq+0xe1/0x1b0 [nvme]
  [ 393.628768] nvme_irq+0x23/0x50 [nvme]
  [ 393.628772] __handle_irq_event_percpu+0x44/0x1a0
  [ 393.628773] handle_irq_event_percpu+0x32/0x80
  [ 393.628774] handle_irq_event+0x3b/0x60
  [ 393.628778] handle_edge_irq+0x7c/0x190
  [ 393.628779] handle_irq+0x20/0x30
  [ 393.628783] do_IRQ+0x46/0xd0
  [ 393.628784] common_interrupt+0x84/0x84
  [ 393.628785] </IRQ>
  ...
  [ 393.628794] ? cpuidle_enter_state+0x97/0x2f0
  [ 393.628796] cpuidle_enter+0x17/0x20
  [ 393.628797] call_cpuidle+0x23/0x40
  [ 393.628798] do_idle+0x18c/0x1f0
  [ 393.628799] cpu_startup_entry+0x73/0x80
  [ 393.628802] start_secondary+0x1a6/0x200
  [ 393.628804] secondary_startup_64+0xa5/0xb0
  [ 393.628805] Code: ...

  [ 405.981597] nvme nvme1: I/O 393 QID 6 timeout, completion polled

  [ 435.597209] INFO: rcu_sched detected stalls on CPUs/tasks:
  [ 435.602858] 30-...0: (1 GPs behind) idle=e26/1/0 softirq=6834/6834 fqs=4485
  [ 435.610203] (detected by 8, t=15005 jiffies, g=6396, c=6395, q=146818)
  [ 435.617025] Sending NMI from CPU 8 to CPUs 30:
  [ 435.617029] NMI backtrace for cpu 30
  [ 435.617031] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic 
#21-Ubuntu
  ...
  [ 435.617047] Call Trace:
  [ 435.617048] <IRQ>
  [ 435.617051] enqueue_entity+0x9f/0x6b0
  [ 435.617053] enqueue_task_fair+0x6c/0x7f0
  [ 435.617056] activate_task+0x57/0xc0
  [ 435.617059] ? sched_clock+0x9/0x10
  [ 435.617060] ? sched_clock+0x9/0x10
  [ 435.617061] ttwu_do_activate+0x49/0x90
  [ 435.617063] try_to_wake_up+0x1df/0x490
  [ 435.617065] default_wake_function+0x12/0x20
  [ 435.617067] autoremove_wake_function+0x12/0x40
  [ 435.617068] __wake_up_common+0x73/0x130
  [ 435.617069] __wake_up_common_lock+0x80/0xc0
  [ 435.617070] __wake_up+0x13/0x20
  [ 435.617073] __wbt_done.part.21+0xa4/0xb0
  [ 435.617074] wbt_done+0x72/0xa0
  [ 435.617077] blk_mq_free_request+0xca/0x1a0
  [ 435.617079] blk_mq_end_request+0x48/0x90
  [ 435.617084] nvme_complete_rq+0x23/0x120 [nvme_core]
  [ 435.617087] nvme_pci_complete_rq+0x7a/0x130 [nvme]
  [ 435.617088] __blk_mq_complete_request+0xd2/0x140
  [ 435.617090] blk_mq_complete_request+0x18/0x20
  [ 435.617091] nvme_process_cq+0xe1/0x1b0 [nvme]
  [ 435.617093] nvme_irq+0x23/0x50 [nvme]
  [ 435.617096] __handle_irq_event_percpu+0x44/0x1a0
  [ 435.617097] handle_irq_event_percpu+0x32/0x80
  [ 435.617098] handle_irq_event+0x3b/0x60
  [ 435.617101] handle_edge_irq+0x7c/0x190
  [ 435.617102] handle_irq+0x20/0x30
  [ 435.617106] do_IRQ+0x46/0xd0
  [ 435.617107] common_interrupt+0x84/0x84
  [ 435.617108] </IRQ>
  ...
  [ 435.617117] ? cpuidle_enter_state+0x97/0x2f0
  [ 435.617118] cpuidle_enter+0x17/0x20
  [ 435.617119] call_cpuidle+0x23/0x40
  [ 435.617121] do_idle+0x18c/0x1f0
  [ 435.617122] cpu_startup_entry+0x73/0x80
  [ 435.617125] start_secondary+0x1a6/0x200
  [ 435.617127] secondary_startup_64+0xa5/0xb0
  [ 435.617128] Code: ...

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1810998/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to