Public bug reported: I am using Ubuntu 14.04 (trusty) with the 4.4.x xenial kernel (the trusty kernel is way easier to make bcache crash). I have mdadm raid1 on /boot and /, backed by 2 SSDs.
I have XFS on 12 ceph directories (/var/lib/ceph/osd/ceph-*), which is backed by bcache, which is backed by one separate disk per osd directory, plus a bcache cache device on an NVMe PCIe device. The bcache cache is shared by all 12 of the osd bcache devices. I also have 2 unused bcache cache devices on the SSDs, without mdadm raid. This hang problem was much more frequent with the cache there, and I suspected mdadm+bcache together, so I moved it to the NVMe. If I let the machines run for a few days, and then detach and attach cache devices, it was very easy to hang it with the cache on the SSDs, but with it on the NVMe, I haven't seen that yet. The uptime on the machine was 33-34 days, and the other ones with same setup are now at 69 days 22h (that's when I changed the cache to NVMe). For the latest hang, when the machine hangs, the text tty at the local terminal has the login prompt, but no stack trace or anything, and typing into them has no effect, not even echoing what is typed. Terminals connected previously with ssh are just hung, not responding to anything. New ssh connections fail. ping to the hung machine still replies. Soft shutdown via IPMI doesn't appear to do anything. I will attach 2 files collected like: `ssh machine cat /dev/kmsg > cephX.kmsg` (since dmesg -w isn't supported here, and the logs do not get saved on the machine's disk since the IO system is hung). ----- Ubuntu bug reporting guidelines stuff ----- # lsb_release -rd Description: Ubuntu 14.04.5 LTS Release: 14.04 not including uname -a, apt-cache policy, since the kernel running now is different. It was linux-image-4.4.0-93-generic when it crashed. (and the previous crash was with 4.4.0-78-generic) Also you should likely discard similar information from the apport collect data, which is from this boot, not the previously hung one. ----- debugging procedures stuff ----- https://help.ubuntu.com/community/DebuggingSystemCrash It wants a memtest, but these machines were tested in the past, and it affects more than 2 machines, so that's not useful. I'll try to remember to try Alt+SysRq+1,t next time. I think the other sections are about getting dmesg output, which I have already, so I'll skip that. ProblemType: Bug DistroRelease: Ubuntu 14.04 Package: linux-image-4.4.0-97-generic 4.4.0-97.120~14.04.1 ProcVersionSignature: Ubuntu 4.4.0-97.120~14.04.1-generic 4.4.87 Uname: Linux 4.4.0-97-generic x86_64 ApportVersion: 2.14.1-0ubuntu3.25 Architecture: amd64 Date: Tue Oct 17 10:50:03 2017 ProcEnviron: TERM=xterm-256color PATH=(custom, no user) LANG=en_US.UTF-8 SHELL=/bin/bash SourcePackage: linux-lts-xenial UpgradeStatus: No upgrade log present (probably fresh install) ** Affects: linux-lts-xenial (Ubuntu) Importance: Undecided Status: New ** Tags: amd64 apport-bug trusty -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1724173 Title: bcache makes the whole io system hang after long run time To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-lts-xenial/+bug/1724173/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs