Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?
Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.17 kernel[0]. If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'. If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'. Once testing of the upstream kernel is complete, please mark this bug as "Confirmed". Thanks in advance. [0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.17 ** Tags added: artful kernel-da-key ** Changed in: linux (Ubuntu) Importance: Undecided => Medium -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1776159 Title: mdadm raid soft lock-ups ubuntu kernel 4.13.0-36 Inbox x Status in linux package in Ubuntu: Incomplete Bug description: we're running Ubuntu 16.04.4, mdadm - v3.3 and Kernel 4.13.0-36(ubuntu package linux-image-generic-hwe-16.04). We have created raid10 using 22 960GB SSDs [1] . The problem we're experiencing is that /usr/share/mdadm/checkarray (executed by cron, included in a mdadm pkg) results in (soft?) deadlock - load on the node spikes up to 500-700 and all I/O operations are blocked for a period of time. We can see traces liek these [2] in our kernel log. e.g. it ends up in static state like test@os-node1:~$ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md1 : active raid10 dm-23[9] dm-22[8] dm-21[7] dm-20[6] dm-18[4] dm-19[5] dm-17[3] dm-16[21] dm-15[20] dm-14[2] dm-13[19] dm-12[18] dm-11[17] dm-10[16] dm-9[15] dm-8[14] dm-7[13] dm-6[12] dm-5[11] dm-4[10] dm-3[1] dm-2[0] 10313171968 blocks super 1.2 512K chunks 2 near-copies [22/22] [UUUUUUUUUUUUUUUUUUUUUU] [===>.................] check = 19.0% (1965748032/10313171968) finish=1034728.8min speed=134K/sec bitmap: 0/39 pages [0KB], 131072KB chunk unused devices: <none> and the only solution is to hard reboot the node. What we found out is that it doesn't happen on idle raid, we have to generate some significant load (10 VMs running fio[3] with 500GB HDDs.) to be able to reproduce the issue. Anyone ever experienced similar issues? Do you have any suggestions how to better trouble shoot this issue and maybe identify if disks or software layer is responsible for this behavior [1] http://www.samsung.com/us/dell/pdfs/PM1633a_Flyer_2016_v4.pdf [2] https://gist.github.com/haad/09213bab1bc30a00c7d255c0bc60897b [3] https://github.com/axboe/fio To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1776159/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp