Public bug reported:

I am using Ubuntu 14.04 (trusty) with the 4.4.x xenial kernel (the
trusty kernel is way easier to make bcache crash). I have mdadm raid1 on
/boot and /, backed by 2 SSDs.

I have XFS on 12 ceph directories (/var/lib/ceph/osd/ceph-*), which is
backed by bcache, which is backed by one separate disk per osd
directory, plus a bcache cache device on an NVMe PCIe device. The bcache
cache is shared by all 12 of the osd bcache devices.

I also have 2 unused bcache cache devices on the SSDs, without mdadm
raid. This hang problem was much more frequent with the cache there, and
I suspected mdadm+bcache together, so I moved it to the NVMe.

If I let the machines run for a few days, and then detach and attach
cache devices, it was very easy to hang it with the cache on the SSDs,
but with it on the NVMe, I haven't seen that yet. The uptime on the
machine was 33-34 days, and the other ones with same setup are now at 69
days 22h (that's when I changed the cache to NVMe).

For the latest hang, when the machine hangs, the text tty at the local
terminal has the login prompt, but no stack trace or anything, and
typing into them has no effect, not even echoing what is typed.
Terminals connected previously with ssh are just hung, not responding to
anything. New ssh connections fail. ping to the hung machine still
replies. Soft shutdown via IPMI doesn't appear to do anything.

I will attach 2 files collected like: `ssh machine cat /dev/kmsg >
cephX.kmsg` (since dmesg -w isn't supported here, and the logs do not
get saved on the machine's disk since the IO system is hung).

----- Ubuntu bug reporting guidelines stuff -----

# lsb_release -rd
Description:    Ubuntu 14.04.5 LTS
Release:        14.04

not including uname -a, apt-cache policy, since the kernel running now
is different. It was linux-image-4.4.0-93-generic when it crashed. (and
the previous crash was with 4.4.0-78-generic)

Also you should likely discard similar information from the apport
collect data, which is from this boot, not the previously hung one.

----- debugging procedures stuff -----

https://help.ubuntu.com/community/DebuggingSystemCrash

It wants a memtest, but these machines were tested in the past, and it
affects more than 2 machines, so that's not useful.

I'll try to remember to try Alt+SysRq+1,t next time.

I think the other sections are about getting dmesg output, which I have
already, so I'll skip that.

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: linux-image-4.4.0-97-generic 4.4.0-97.120~14.04.1
ProcVersionSignature: Ubuntu 4.4.0-97.120~14.04.1-generic 4.4.87
Uname: Linux 4.4.0-97-generic x86_64
ApportVersion: 2.14.1-0ubuntu3.25
Architecture: amd64
Date: Tue Oct 17 10:50:03 2017
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-lts-xenial
UpgradeStatus: No upgrade log present (probably fresh install)

** Affects: linux-lts-xenial (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: amd64 apport-bug trusty

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1724173

Title:
  bcache makes the whole io system hang after long run time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-lts-xenial/+bug/1724173/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to