** Description changed:

  I am using Ubuntu 14.04 (trusty) with the 4.4.x xenial kernel (the
  trusty kernel is way easier to make bcache crash). I have mdadm raid1 on
  /boot and /, backed by 2 SSDs.
  
  I have XFS on 12 ceph directories (/var/lib/ceph/osd/ceph-*), which is
  backed by bcache, which is backed by one separate disk per osd
  directory, plus a bcache cache device on an NVMe PCIe device. The bcache
  cache is shared by all 12 of the osd bcache devices.
  
  I also have 2 unused bcache cache devices on the SSDs, without mdadm
  raid. This hang problem was much more frequent with the cache there, and
  I suspected mdadm+bcache together, so I moved it to the NVMe.
+ 
+ The problem happens on all these devices used as bcache: Micron
+ S630DC-400 (firmware M013 and M017), SAMSUNG MZ7KM480HMHQ-00005 (SM863a,
+ firmware GXM5004Q), Intel DC P3700 800GB.
  
  If I let the machines run for a few days, and then detach and attach
  cache devices, it was very easy to hang it with the cache on the SSDs,
  but with it on the NVMe, I haven't seen that yet. The uptime on the
  machine was 33-34 days, and the other ones with same setup are now at 69
  days 22h (that's when I changed the cache to NVMe).
  
  For the latest hang, when the machine hangs, the text tty at the local
  terminal has the login prompt, but no stack trace or anything, and
  typing into them has no effect, not even echoing what is typed.
  Terminals connected previously with ssh are just hung, not responding to
  anything. New ssh connections fail. ping to the hung machine still
  replies. Soft shutdown via IPMI doesn't appear to do anything.
  
  I will attach 2 files collected like: `ssh machine cat /dev/kmsg >
  cephX.kmsg` (since dmesg -w isn't supported here, and the logs do not
  get saved on the machine's disk since the IO system is hung).
  
  ----- Ubuntu bug reporting guidelines stuff -----
  
  # lsb_release -rd
  Description:    Ubuntu 14.04.5 LTS
  Release:        14.04
  
  not including uname -a, apt-cache policy, since the kernel running now
  is different. It was linux-image-4.4.0-93-generic when it crashed. (and
  the previous crash was with 4.4.0-78-generic)
  
  Also you should likely discard similar information from the apport
  collect data, which is from this boot, not the previously hung one.
  
  ----- debugging procedures stuff -----
  
  https://help.ubuntu.com/community/DebuggingSystemCrash
  
  It wants a memtest, but these machines were tested in the past, and it
  affects more than 2 machines, so that's not useful.
  
  I'll try to remember to try Alt+SysRq+1,t next time.
  
  I think the other sections are about getting dmesg output, which I have
  already, so I'll skip that.
  
  ProblemType: Bug
  DistroRelease: Ubuntu 14.04
  Package: linux-image-4.4.0-97-generic 4.4.0-97.120~14.04.1
  ProcVersionSignature: Ubuntu 4.4.0-97.120~14.04.1-generic 4.4.87
  Uname: Linux 4.4.0-97-generic x86_64
  ApportVersion: 2.14.1-0ubuntu3.25
  Architecture: amd64
  Date: Tue Oct 17 10:50:03 2017
  ProcEnviron:
-  TERM=xterm-256color
-  PATH=(custom, no user)
-  LANG=en_US.UTF-8
-  SHELL=/bin/bash
+  TERM=xterm-256color
+  PATH=(custom, no user)
+  LANG=en_US.UTF-8
+  SHELL=/bin/bash
  SourcePackage: linux-lts-xenial
  UpgradeStatus: No upgrade log present (probably fresh install)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1724173

Title:
  bcache makes the whole io system hang after long run time

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-lts-xenial/+bug/1724173/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to