*** This bug is a duplicate of bug 1709889 ***
https://bugs.launchpad.net/bugs/1709889
------- Comment From [email protected] 2018-10-24 03:48 EDT-------
- - - Problem - - - -
I see hardlcokup messages and IO request hung, when running HTX mdt.io
exerciser on SAS disks configured from LSI9361 adapter.
uname -a
Linux ltc-boston1 4.15.0-26-generic #28-Ubuntu SMP Wed Jul 4 16:19:53 UTC 2018
ppc64le ppc64le ppc64le GNU/Linux
Machine: Power 9 Boston LC
Firmware: P9DSU-V1.16-20180531-imp
adapter: AVAGO MegaRAID SAS 9361-8i
firmware: 24.21.0-0025
HTX IO started on these devices:
252:2 2 JBOD - 3.637 TB SATA HDD N N 512B ST4000NM0024-1HT178 U
252:5 5 JBOD - 3.637 TB SATA HDD N N 512B ST4000NM0024-1HT178 U
252:6 6 JBOD - 3.637 TB SATA HDD N N 512B ST4000NM0024-1HT178 U
252:7 7 JBOD - 3.637 TB SATA HDD N N 512B ST4000NM0024-1HT178 U
After more than 12hrs of IO stress I see these trace message in dmesg:
Watchdog CPU:95 Hard LOCKUP
Modules linked in: mpt3sas raid_class scsi_transport_sas xt_CHECKSUM
iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4
nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables devlink
ip6table_filter ip6_tables iptable_filter kvm_hv kvm joydev input_leds ofpart
cmdlinepart at24 uio_pdrv_genirq uio ipmi_powernv ipmi_devintf opal_prd
vmx_crypto powernv_flash mtd ipmi_msghandler ibmpowernv sch_fq_codel ib_iser
rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi
scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10
raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq
libcrc32c raid1 raid0 multipath linear hid_generic usbhid
hid lpfc qla2xxx ast i2c_algo_bit drm_kms_helper nvmet_fc nvmet syscopyarea
sysfillrect nvme_fc sysimgblt fb_sys_fops nvme_fabrics ttm nvme_core
crct10dif_vpmsum crc32c_vpmsum drm megaraid_sas i40e scsi_transport_fc aacraid
CPU: 95 PID: 0 Comm: swapper/95 Not tainted 4.15.0-26-generic #28-Ubuntu
NIP: c00000000015fe7c LR: c000000000162e6c CTR: c000000000ac04f0
REGS: c00000003fb8bd80 TRAP: 0900 Not tainted (4.15.0-26-generic)
MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 44022488 XER: 00000000
CFAR: c000000000162e68 SOFTE: 0
GPR00: c000000000162e6c c0002007ff3eb7b0 c0000000016eaf00 c0002007174d6c00
GPR04: c00020070663b180 0000000000000049 c0000000fccd7fb8 0000a5a2102a9058
GPR08: 0000000000000003 0000000000000041 0000000000000000 0000000000000000
GPR12: c000000000ac04f0 c00000000fac1500 c00020072736ff90 0000000000000000
GPR16: 0000000000000000 0000000000000100 0000000000000004 0000000000000028
GPR20: c000000001712220 0000000000200042 00000000000002f8 000020072d250000
GPR24: c0000000011d8580 c00000000171dd78 c00020072e428580 0000000000000000
GPR28: 0000000000000049 c00020072e428580 c0002007174d6c00 0000000000000000
NIP [c00000000015fe7c] update_curr+0x2c/0x2f0
LR [c000000000162e6c] enqueue_entity+0x5c/0xd90
Call Trace:
[c0002007ff3eb7b0] [c0002007ff3eb840] 0xc0002007ff3eb840 (unreliable)
[c0002007ff3eb840] [c000000000163074] enqueue_entity+0x264/0xd90
[c0002007ff3eb8f0] [c000000000163c50] enqueue_task_fair+0xb0/0x7c0
[c0002007ff3eb9c0] [c00000000014d998] activate_task+0x88/0x130
[c0002007ff3eba40] [c00000000014df30] ttwu_do_activate+0x70/0xd0
[c0002007ff3eba80] [c00000000014f400] try_to_wake_up+0x230/0x660
[c0002007ff3ebb00] [c000000000431a40] blkdev_bio_end_io_simple+0x30/0x50
[c0002007ff3ebb20] [c000000000679ed4] bio_endio+0x134/0x200
[c0002007ff3ebb60] [c000000000684ad0] blk_update_request+0xd0/0x4b0
[c0002007ff3ebbf0] [c0000000009073f0] scsi_end_request+0x50/0x270
[c0002007ff3ebc50] [c0000000009078c4] scsi_io_completion+0x2b4/0x750
[c0002007ff3ebd10] [c0000000008fbad8] scsi_finish_command+0x158/0x1b0
[c0002007ff3ebd90] [c000000000906a48] scsi_softirq_done+0x198/0x220
[c0002007ff3ebe20] [c000000000691fc8] blk_done_softirq+0xb8/0xe0
[c0002007ff3ebe60] [c000000000d02528] __do_softirq+0x158/0x3e4
[c0002007ff3ebf40] [c000000000116988] irq_exit+0xe8/0x120
[c0002007ff3ebf60] [c000000000017888] __do_irq+0x88/0x1c0
[c0002007ff3ebf90] [c00000000002a2d0] call_do_irq+0x14/0x24
[c00020072736fa90] [c000000000017a5c] do_IRQ+0x9c/0x130
[c00020072736fae0] [c000000000009bf4] h_virt_irq_common+0x114/0x120
--- interrupt: ea1 at replay_interrupt_return+0x0/0x4
LR = arch_local_irq_restore+0x74/0x90
[c00020072736fdd0] [000000000000005f] 0x5f (unreliable)
[c00020072736fdf0] [c000000000ac3d20] cpuidle_enter_state+0xf0/0x450
[c00020072736fe50] [c00000000017419c] call_cpuidle+0x4c/0x90
[c00020072736fe70] [c0000000001745b0] do_idle+0x2b0/0x330
[c00020072736fec0] [c00000000017486c] cpu_startup_entry+0x3c/0x50
[c00020072736fef0] [c00000000004a630] start_secondary+0x4f0/0x510
[c00020072736ff90] [c00000000000ab6c] start_secondary_prolog+0x10/0x14
Instruction dump:
60420000 3c4c0159 3842b0b0 7c0802a6 60000000 fba1ffe8 fbc1fff0 fbe1fff8
f821ff71 7c7e1b78 eba301b0 ebe30040 <813d0a58> 2b890001 409d024c 2fbf0000
Watchdog CPU:95 became unstuck
INFO: task hxestorage:19928 blocked for more than 120 seconds.
Not tainted 4.15.0-26-generic #28-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
hxestorage D 0 19928 6381 0x00040000
Call Trace:
[c0002007059df720] [c0002007059df770] 0xc0002007059df770 (unreliable)
[c0002007059df720] [c0002007059df770] 0xc0002007059df770 (unreliable)
[c0002007059df8f0] [c00000000001c320] __switch_to+0x2a0/0x4d0
[c0002007059df950] [c000000000cfab84] __schedule+0x2a4/0xaf0
[c0002007059dfa20] [c000000000cfb410] schedule+0x40/0xc0
[c0002007059dfa40] [c000000000152cbc] io_schedule+0x2c/0x50
[c0002007059dfa70] [c000000000431c34] __blkdev_direct_IO_simple+0x1d4/0x3e0
[c0002007059dfba0] [c0000000004321a0] blkdev_direct_IO+0x360/0x540
[c0002007059dfc70] [c0000000002e17ec] generic_file_read_iter+0xbc/0x210
[c0002007059dfcd0] [c000000000432e80] blkdev_read_iter+0x50/0x80
[c0002007059dfcf0] [c0000000003d1db0] new_sync_read+0x100/0x160
[c0002007059dfd80] [c0000000003d526c] vfs_read+0xbc/0x1b0
[c0002007059dfdd0] [c0000000003d5ae4] SyS_pread64+0xc4/0xf0
[c0002007059dfe30] [c00000000000b284] system_call+0x58/0x6c
INFO: task hxestorage:19930 blocked for more than 120 seconds.
Not tainted 4.15.0-26-generic #28-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
hxestorage D 0 19930 6381 0x00040000
Call Trace:
[c0002007059e3660] [c0002007059e36b0] 0xc0002007059e36b0 (unreliable)
[c0002007059e3830] [c00000000001c320] __switch_to+0x2a0/0x4d0
[c0002007059e3890] [c000000000cfab84] __schedule+0x2a4/0xaf0
[c0002007059e3960] [c000000000cfb410] schedule+0x40/0xc0
[c0002007059e3980] [c000000000152cbc] io_schedule+0x2c/0x50
[c0002007059e39b0] [c000000000431c34] __blkdev_direct_IO_simple+0x1d4/0x3e0
[c0002007059e3ae0] [c0000000004321a0] blkdev_direct_IO+0x360/0x540
[c0002007059e3bb0] [c0000000002e1a08] generic_file_direct_write+0xc8/0x240
[c0002007059e3c20] [c0000000002e1c8c] __generic_file_write_iter+0x10c/0x2a0
[c0002007059e3c80] [c0000000004336dc] blkdev_write_iter+0xac/0x160
[c0002007059e3cf0] [c0000000003d1f14] new_sync_write+0x104/0x160
[c0002007059e3d80] [c0000000003d5658] vfs_write+0xd8/0x220
[c0002007059e3dd0] [c0000000003d5bd4] SyS_pwrite64+0xc4/0xf0
[c0002007059e3e30] [c00000000000b284] system_call+0x58/0x6c
INFO: task hxestorage:19933 blocked for more than 120 seconds.
Not tainted 4.15.0-26-generic #28-Ubuntu
HTX error messages:
-----------------------------
Device id:/dev/sdm
Timestamp:Jul 8 17:13:48 2018
err=ffffffff
sev=4
Exerciser Name:hxestorage
Serial No:Not Available
Part No:Not Available
Location:Not Available
FRU Number:Not Available
Device:Not Available
Error Text:Hung I/O alert! Segment table-0, Detected 2 I/O(s) hung.
Current time: 1531088028; hang criteria: 600 secs, Hard hang threshold: 3
Process ID: 0x4c32
1st lba Blocks Kernel Hang Duration
(Hex) (Hex) Thread Cnt (Secs)
0x111caa00 200 71fee7e5f180 1 600
0x1812175 b9 71fedf7ef180 0 1
0x117130c 64 71fedefdf180 0 1
0x4658bda d7 71fec776f180 0 315
0xea8b2f0 110 71fec8f9f180 1 600
0x1929e10 1c 71fecafdf180 0 1
0x105d7f73 167 71feca7cf180 0 1
0xdeb615c 171 71fec3eff180 0 1
0x143bbde 17a 71fed17af180 0 1
0x10fe4601 cd 71fec7f7f180 0 1
0xfd7d286 1bd 71fec470f180 0 1
0xdef23c0 3a 71fec97af180 0 1
0xfda444c 38 71fede7cf180 0 1
0xbbbc2f8 1e2 71fed2fdf180 0 1
0x6232866 1ce 71fed0f9f180 0 1
0x17197e6 5a 71fec1ebf180 0 0
0x8f40f7c 1b3 71fec2edf180 0 347
---------------------------------------------------------------------
Device id:/dev/sdm
Timestamp:Jul 8 17:18:01 2018
err=ffffffff
sev=4
Exerciser Name:hxestorage
Serial No:Not Available
Part No:Not Available
Location:Not Available
FRU Number:Not Available
Device:Not Available
Error Text:Hung I/O alert! Segment table-1, Detected 2 I/O(s) hung.
Current time: 1531088281; hang criteria: 600 secs, Hard hang threshold: 3
Process ID: 0x4c32
1st lba Blocks Kernel Hang Duration
(Hex) (Hex) Thread Cnt (Secs)
0xfe0c6958 200 71fee6e3f180 1 853
0xbd316558 200 71fee764f180 1 760
0xfc361c6c 72 71fedd7af180 0 1
0xf7a98b6f 191 71fecafdf180 0 0
0xf4d05a00 7b 71fee560f180 0 0
0xf9c61d8e 1d4 71fed2fdf180 0 1
0xc5df7559 1f 71feca7cf180 0 1
0xf766b0b6 4c 71fec878f180 0 258
0xfcc2056e 8c 71fec4f1f180 0 1
0xe9ee8874 11a 71fec674f180 0 1
0xc64a1d0e 1e5 71fec6f5f180 0 1
0xebb99520 17a 71fecbfff180 0 1
0xc73f263b 4a 71fedffff180 0 1
0xf1f0b752 c3 71fed27cf180 0 1
---------------------------------------------------------------------
Note: IO continues after these messages.
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.05 0.02 22.69 0.00 77.24
Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 138.15 10182.29 6589.58 2359613717 1527049245
sdk 137.52 10192.35 6621.95 2361945192 1534552335
sdl 138.26 10205.54 6613.76 2365000874 1532653682
sdm 138.65 10192.23 6585.01 2361917346 1525990241
-----------------
This is almost-certainly the WELL KNOWN issue of CFQ scheduler, which
Ubuntu enables by default. Change the I/O scheduler to use "deadline".
The scheduler for individual disks may be set, or else system-wide using
the "elevator=deadline" boot parameter.
--------------
I do not see IO hung or any call trace problem with "elevator=deadline"
enabled in kernel commandline
Regards,
Abdul
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1785081
Title:
[Ubuntu180401][bostonlc] HTX IO hung error with Watchdog CPU:95 Hard
LOCKUP trace messages (LSI9361/mpt3sas)
Status in The Ubuntu-power-systems project:
Triaged
Status in linux package in Ubuntu:
New
Bug description:
== Comment: #7 - Frederic Bonnard <[email protected]> - 2018-08-01 09:55:26
==
Mirroring this bug so that
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1709889 can be updated
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1785081/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp