------- Comment From jac...@de.ibm.com 2017-03-08 05:56 EDT------- Just tried the newest Kernel 4.4.0-66, and I'm still running into the hang. Here the final statements in /var/log/syslog (the lines, that never make it out onto the disk):
Mar 8 11:26:31 mclint multipathd[955]: mpatha: sdb - tur checker timed out Mar 8 11:26:31 mclint multipathd[955]: 8:16: reinstated Mar 8 11:26:31 mclint multipathd[955]: mpatha: sdd - tur checker timed out Mar 8 11:26:31 mclint rsyslogd-2007: action 'action 10' suspended, next retry is Wed Mar 8 11:27:01 2017 [v8.16.0 try http://www.rsyslog.com/e/2007 ] Mar 8 11:26:31 mclint multipathd[955]: 8:48: reinstated Mar 8 11:26:31 mclint multipathd[955]: mpatha: sdc - tur checker timed out Mar 8 11:26:31 mclint multipathd[955]: 8:32: reinstated Mar 8 11:26:32 mclint multipathd[955]: mpatha: sda - tur checker timed out Mar 8 11:26:32 mclint multipathd[955]: 8:0: reinstated And this here shows up on the sclp_line console: ? 961.419327! INFO: task cpuplugd:2604 blocked for more than 120 seconds. ? 961.419337! Not tainted 4.4.0-66-generic #87-Ubuntu ? 961.419338! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ? 961.419404! INFO: task irqbalance:2651 blocked for more than 120 seconds. ? 961.419406! Not tainted 4.4.0-66-generic #87-Ubuntu ? 961.419407! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ? 961.419450! INFO: task kworker/0:4:3801 blocked for more than 120 seconds. ? 961.419451! Not tainted 4.4.0-66-generic #87-Ubuntu ? 961.419452! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ? 961.419494! INFO: task kworker/1:1:4548 blocked for more than 120 seconds. ? 961.419495! Not tainted 4.4.0-66-generic #87-Ubuntu ? 961.419496! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ? 961.419539! INFO: task kworker/0:0H:20302 blocked for more than 120 seconds. ? 961.419540! Not tainted 4.4.0-66-generic #87-Ubuntu ? 961.419541! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ? 961.419764! INFO: task kworker/0:0:66641 blocked for more than 120 seconds. ? 961.419766! Not tainted 4.4.0-66-generic #87-Ubuntu ? 961.419767! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ? 961.419895! INFO: task rm:81710 blocked for more than 120 seconds. ? 961.419896! Not tainted 4.4.0-66-generic #87-Ubuntu ? 961.419897! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ? 1081.419024! INFO: task systemd:1 blocked for more than 120 seconds. ? 1081.419033! Not tainted 4.4.0-66-generic #87-Ubuntu ? 1081.419035! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ? 1081.419148! INFO: task cpuplugd:2604 blocked for more than 120 seconds. ? 1081.419150! Not tainted 4.4.0-66-generic #87-Ubuntu ? 1081.419151! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ? 1081.419186! INFO: task irqbalance:2651 blocked for more than 120 seconds. ? 1081.419187! Not tainted 4.4.0-66-generic #87-Ubuntu ? 1081.419188! "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. I pulled a DASD-Dump from the system: KERNEL: /usr/lib/debug/boot/vmlinux-4.4.0-66-generic DUMPFILE: mclint_20170308_kernel_4_4_0-66_without_openafs.dump CPUS: 3 DATE: Wed Mar 8 11:37:56 2017 UPTIME: 00:25:30 LOAD AVERAGE: 12.99, 11.25, 6.55 TASKS: 422 NODENAME: mclint RELEASE: 4.4.0-66-generic VERSION: #87-Ubuntu SMP Fri Mar 3 15:32:53 UTC 2017 MACHINE: s390x (unknown Mhz) MEMORY: 7.8 GB PANIC: "" PID: 0 COMMAND: "swapper/0" TASK: bb1538 (1 of 3) [THREAD_INFO: b7c000] CPU: 0 STATE: TASK_RUNNING (ACTIVE) INFO: no panic task found And again I see 10 multipath-Daemons in the process list, this is my typical hang scenario. crash> ps | grep multipathd 955 1 0 1e49115f0 IN 0.1 335364 8316 multipathd 971 1 0 7e8b2be0 IN 0.1 335364 8316 multipathd 972 1 0 7e8b6db0 IN 0.1 335364 8316 multipathd 977 1 1 7e8b36d8 IN 0.1 335364 8316 multipathd 978 1 0 7e8b62b8 IN 0.1 335364 8316 multipathd 979 1 2 7e8b4cc8 IN 0.1 335364 8316 multipathd 81714 1 1 7cdc8000 IN 0.1 335364 8316 multipathd 81715 1 1 7cdc95f0 IN 0.1 335364 8316 multipathd 81716 1 1 7cdcc1d0 IN 0.1 335364 8316 multipathd 81717 1 1 1e6c595f0 IN 0.1 335364 8316 multipathd I'll compress the dump and try to find ways to make it available to you ... -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1670634 Title: blk-mq: possible deadlock on CPU hot(un)plug Status in Ubuntu on IBM z Systems: New Status in linux package in Ubuntu: New Bug description: == Comment: #0 - Carsten Jacobi <jac...@de.ibm.com> - 2017-03-07 03:35:31 == I'm evaluating Ubuntu-Xenial on z for development purposes, the test system is installed in an LPAR with one FCP-LUN which is accessable by 4 pathes (all pathes are configured). The system hangs regularly when I make packages with "pdebuild" using the pbuilder packaging suit. The local Linux development team helped me out with a pre-analysis that I can post here (thanks a lot for that): With the default settings and under a certain workload, blk_mq seems to get into a presumed "deadlock". Possibly this happens on CPU hot(un)plug. After the I/O stalled, a dump was pulled manually. The following information is from the crash dump pre-analysis. $ zgetdump -i dump.0 General dump info: Dump format........: elf Version............: 1 UTS node name......: mclint UTS kernel release.: 4.4.0-65-generic UTS kernel version.: #86-Ubuntu SMP Thu Feb 23 17:54:37 UTC 2017 System arch........: s390x (64 bit) CPU count (online).: 2 Dump memory range..: 8192 MB Memory map: 0000000000000000 - 00000001b831afff (7043 MB) 00000001b831b000 - 00000001ffffffff (1149 MB) Things look similarly with HWE kernel ubuntu16.04-4.8.0-34.36~16.04.1. KERNEL: vmlinux.full DUMPFILE: dump.0 CPUS: 2 DATE: Fri Mar 3 14:31:07 2017 UPTIME: 02:11:20 LOAD AVERAGE: 13.00, 12.92, 11.37 TASKS: 411 NODENAME: mclint RELEASE: 4.4.0-65-generic VERSION: #86-Ubuntu SMP Thu Feb 23 17:54:37 UTC 2017 MACHINE: s390x (unknown Mhz) MEMORY: 7.8 GB PANIC: "" PID: 0 COMMAND: "swapper/0" TASK: bad528 (1 of 2) [THREAD_INFO: b78000] CPU: 0 STATE: TASK_RUNNING (ACTIVE) INFO: no panic task found crash> dev -d MAJOR GENDISK NAME REQUEST_QUEUE TOTAL ASYNC SYNC DRV ... 8 1e1d6d800 sda 1e1d51210 0 23151 4294944145 N/A(MQ) 8 1e4e06800 sdc 2081b18 0 23148 4294944148 N/A(MQ) 8 1f07800 sdb 20c7568 0 23195 4294944101 N/A(MQ) 8 1e4e06000 sdd 1e4e31210 0 23099 4294944197 N/A(MQ) 252 1e1d6c800 dm-0 1e1d51b18 9 1 8 N/A(MQ) ... So both dm-mpath and sd have requests pending in their block multiqueue. The large numbers of sd look strange and seem to be the unsigned formatting of the values shown for async multiplied by -1. [ 0.798256] Linux version 4.4.0-65-generic (buildd@z13-011) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #86-Ubuntu SMP Thu Feb 23 17:54:37 UTC 2017 (Ubuntu 4.4.0-65.86-generic 4.4.49) [ 0.798262] setup: Linux is running natively in 64-bit mode [ 0.798290] setup: Max memory size: 8192MB [ 0.798298] setup: Reserving 196MB of memory at 7996MB for crashkernel (System RAM: 7996MB) [ 0.836923] Kernel command line: root=/dev/mapper/mclint_vg-root rootflags=subvol=@ crashkernel=196M BOOT_IMAGE=0 [ 5281.179428] INFO: task xfsaild/dm-11:1604 blocked for more than 120 seconds. [ 5281.179437] Not tainted 4.4.0-65-generic #86-Ubuntu [ 5281.179438] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 5281.179440] xfsaild/dm-11 D 00000000007bcf52 0 1604 2 0x00000000 [ 5281.179444] 00000001e931c230 00000000001a6964 00000001e6f9b958 00000001e6f9b9d8 00000001e15795f0 00000001e6f9b988 0000000000ce8c00 00000001ea805c70 00000001ea805c00 0000000000ba5ed0 00000001e931c1d0 00000001e1579b20 00000001ea805c00 00000001e15795f0 00000001ea805c00 0000000000000000 00000000007d3978 00000000007bc9f8 00000001e6f9b9d8 00000001e6f9ba40 [ 5281.179454] Call Trace: [ 5281.179461] ([<00000000007bc9f8>] __schedule+0x300/0x810) [ 5281.179462] [<00000000007bcf52>] schedule+0x4a/0xb0 [ 5281.179465] [<00000000007c02aa>] schedule_timeout+0x232/0x2a8 [ 5281.179466] [<00000000007bde50>] wait_for_common+0x110/0x1c8 [ 5281.179472] [<000000000017b602>] flush_work+0x42/0x58 [ 5281.179564] [<000003ff805e14ba>] xlog_cil_force_lsn+0x7a/0x238 [xfs] [ 5281.179589] [<000003ff805dee82>] _xfs_log_force+0x9a/0x2e8 [xfs] [ 5281.179615] [<000003ff805df114>] xfs_log_force+0x44/0x100 [xfs] [ 5281.179640] [<000003ff805ec668>] xfsaild+0x170/0x798 [xfs] [ 5281.179643] [<000000000018335a>] kthread+0x10a/0x110 [ 5281.179645] [<00000000007c0ff6>] kernel_thread_starter+0x6/0xc [ 5281.179646] [<00000000007c0ff0>] kernel_thread_starter+0x0/0xc see below [ 5281.179664] INFO: task cpuplugd:2260 blocked for more than 120 seconds. [ 5281.179665] Not tainted 4.4.0-65-generic #86-Ubuntu [ 5281.179666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 5281.179668] cpuplugd D 00000000007bcf52 0 2260 1 0x00000000 [ 5281.179670] 00000001e782e318 00000000001a6964 000000007bc4ba58 000000007bc4bad8 00000001d4076db0 000000007bc4ba88 0000000000ce8c00 00000001ea805c70 00000001ea805c00 0000000000ba5ed0 00000001e782e2b8 00000001d40772e0 00000001ea805c00 00000001d4076db0 00000001ea805c00 0000000000000000 00000000007d3978 00000000007bc9f8 000000007bc4bad8 000000007bc4bb40 [ 5281.179678] Call Trace: [ 5281.179680] ([<00000000007bc9f8>] __schedule+0x300/0x810) [ 5281.179681] [<00000000007bcf52>] schedule+0x4a/0xb0 [ 5281.179685] [<0000000000516cc2>] blk_mq_freeze_queue_wait+0x62/0xc8 [ 5281.179687] [<0000000000519412>] blk_mq_queue_reinit_notify+0x11a/0x240 [ 5281.179690] [<00000000001844c6>] notifier_call_chain+0x56/0x98 [ 5281.179692] [<000000000018466a>] __raw_notifier_call_chain+0x2a/0x38 [ 5281.179696] [<00000000001605ac>] _cpu_up+0x10c/0x1b0 [ 5281.179698] [<0000000000160738>] cpu_up+0xe8/0x108 [ 5281.179700] [<00000000005d08be>] cpu_subsys_online+0x56/0xb0 [ 5281.179703] [<00000000005ca1c2>] device_online+0x82/0xc0 [ 5281.179704] [<00000000005ca28a>] online_store+0x8a/0x98 [ 5281.179710] [<00000000003a4d12>] kernfs_fop_write+0x13a/0x190 [ 5281.179712] [<000000000031218c>] vfs_write+0x94/0x1a0 [ 5281.179714] [<0000000000312e9e>] SyS_write+0x66/0xd8 [ 5281.179715] [<00000000007c0e3e>] system_call+0xd6/0x264 [ 5281.179718] [<000003ff803df478>] zlib_tr_flush_block+0x650/0x830 [zlib_deflate] Cpuplugd performs CPU hot(un)plug based on configurable rules. https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.ludd/ludd_r_cpuplugdcmd.html https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.ludd/ludd_t_cpu_act.html https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.ludd/ludd_r_numa_know_cpu.html [ 5281.179769] INFO: task kworker/0:2:23669 blocked for more than 120 seconds. [ 5281.179770] Not tainted 4.4.0-65-generic #86-Ubuntu [ 5281.179771] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 5281.179773] kworker/0:2 D 00000000007bcf52 0 23669 2 0x00000000 [ 5281.179801] Workqueue: xfs-cil/dm-11 xlog_cil_push_work [xfs] [ 5281.179802] 00000001c925e318 00000000001a6964 0000000068c5f9f8 0000000068c5fa78 00000001da99b6d8 0000000068c5fa10 0000000000ce8c00 00000001ea805c70 00000001ea805c00 00000001e782e94c 00000001c925e2b8 00000001da99bc08 00000001ea805c00 00000001da99b6d8 00000001ea805c00 0000000000000000 00000000007d3978 00000000007bc9f8 0000000068c5fa78 0000000068c5fae0 [ 5281.179810] Call Trace: [ 5281.179812] ([<00000000007bc9f8>] __schedule+0x300/0x810) [ 5281.179813] [<00000000007bcf52>] schedule+0x4a/0xb0 [ 5281.179839] [<000003ff805de144>] xlog_state_get_iclog_space+0x124/0x338 [xfs] [ 5281.179864] [<000003ff805de702>] xlog_write+0x1ea/0x800 [xfs] [ 5281.179890] [<000003ff805e09a6>] xlog_cil_push+0x286/0x508 [xfs] [ 5281.179891] [<000000000017c400>] process_one_work+0x1a0/0x4f8 [ 5281.179893] [<000000000017c7a2>] worker_thread+0x4a/0x530 [ 5281.179894] [<000000000018335a>] kthread+0x10a/0x110 [ 5281.179896] [<00000000007c0ff6>] kernel_thread_starter+0x6/0xc [ 5281.179898] [<00000000007c0ff0>] kernel_thread_starter+0x0/0xc While above kworker executes a work item for a long duration, other processes block in turn on flush_work for a long duration. [ 5281.179728] INFO: task kworker/0:1:4454 blocked for more than 120 seconds. [ 5281.179730] Not tainted 4.4.0-65-generic #86-Ubuntu [ 5281.179731] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 5281.179732] kworker/0:1 D 00000000007bcf52 0 4454 2 0x00000000 [ 5281.179738] Workqueue: events vmstat_shepherd [ 5281.179739] 00000001c925ac40 00000000001a6964 000000007eb8bb38 000000007eb8bbb8 00000001e782e2b8 000000007eb8bb50 0000000000ce8c00 00000001ea805c70 00000001c925abe0 00000001c925b274 00000001c925abe0 00000001e782e7e8 00000001ea805c00 00000001e782e2b8 00000001ea805c00 0000000000000000 00000000007d3978 00000000007bc9f8 000000007eb8bbb8 000000007eb8bc20 [ 5281.179747] Call Trace: [ 5281.179749] ([<00000000007bc9f8>] __schedule+0x300/0x810) [ 5281.179750] [<00000000007bcf52>] schedule+0x4a/0xb0 [ 5281.179752] [<00000000007bd39a>] schedule_preempt_disabled+0x2a/0x38 [ 5281.179753] [<00000000007becc4>] __mutex_lock_slowpath+0xcc/0x170 [ 5281.179755] [<00000000007bedc6>] mutex_lock+0x5e/0x78 [ 5281.179756] [<000000000015fba0>] get_online_cpus+0x40/0x68 [ 5281.179757] [<00000000002a3ccc>] vmstat_shepherd+0x44/0x168 [ 5281.179759] [<000000000017c400>] process_one_work+0x1a0/0x4f8 [ 5281.179761] [<000000000017c7a2>] worker_thread+0x4a/0x530 [ 5281.179762] [<000000000018335a>] kthread+0x10a/0x110 [ 5281.179764] [<00000000007c0ff6>] kernel_thread_starter+0x6/0xc [ 5281.179765] [<00000000007c0ff0>] kernel_thread_starter+0x0/0xc This work item cannot progress maybe because cpuplugd:2260 above "hangs" in the cpu hotplug notifier chain. The low level device driver (here zfcp) is completely idle without any pending I/O after the lockup happened and all its paths are in good state and could service I/O but it simply does not get any new I/O requests from the upper layers (scsi / block). Zfcp does not implement blk_mq so dm or scsi translate which works in general but fails with above workload. There were no other undesired events, i.e. no path interruptions nor any recovery in zfcp. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-z-systems/+bug/1670634/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp