> Is there any update on this issue?

I'm still debugging, but can give a summary of the problem, at least on
the system I'm working on.

First, to address:
> One is the vfree WARNING which indicates that the error paths are not quite 
> right

this is a trival fix to the error handling path; it's fixed upstream by
commit f58944e265d4ebe47216a5d7488aee3928823d30 ('NVMe: Simplify device
reset failure'), or alternately can be fixed more simply by removing the
unmap call from the error path, since the later removal also unmaps
(which is where the error comes from, unmapping already-unmapped
memory).  I'll SRU a fix for this, but of course this is just a minor
annoyance since the real problem is why the device timed out during
probing.

> why the NVMe drive failed to function correctly - which is the primary
issue for this test case

the problem here, in debugging on the system i have access to at least,
is the nvme driver isn't able to obtain the interrupt for the nvme
controller admin queue, and so the requests send to the admin queue (to
configure the controller) are never completed, and eventually time out,
leading the aborting the device probing.

The reason this used to work (e.g. with stock 4.4.x) is the nvme driver
used to use a kthread to poll the admin queue and complete any work
directly.  The kthread polling was removed upstream with commit
79f2b358c9ba373943a9284be2861fde58291c4e ('nvme: don't poll the CQ from
the kthread') which was pulled into the xenial kernel at 4.4.0-35 - but,
probe-time kthread polling was removed by upstream commit
7385014c073263b077442439299fad013edd4409 ('nvme: only add a controller
to dev_list after it's been fully initialized') which came into xenial
very early at 4.4.0-3.  So if the irq doesn't work, the nvme probing
times out.

As to why the nvme driver can't obtain the irq, I can't determine yet,
which is what i'm still debugging.  If you have a test system I can
access, I can debug there as well, or at least provide a debug kernel to
try (but i'm still adding more debug).  I'm not working on a ppc system,
so it's possible something else is going on there.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1639920

Title:
  NVMe detection failed during bootup

Status in linux package in Ubuntu:
  Triaged

Bug description:
  I've been running an on-off test on a couple of Power8 systems and
  have been getting a failure of detection on the NVMe drives on Ubuntu
  16.04.1 only.  I've ran the same test on RHEL 7.2 and have not
  encountered this proble. Once the problem occurs the OS will stop to
  boot up and a message appears:

  Welcome to emergency mode! After logging in, type "journalctl -xb" to view 
system logs, "systemctl reboot" to reboot, "systemctl default" or ^D to try 
again to boot into default mode.
  Give root password for maintenance (or press Control-D to continue):

  ProblemType: Crash
  DistroRelease: Ubuntu 16.04
  Package: apport 2.20.1-0ubuntu2.1
  ProcVersionSignature: Ubuntu 4.4.0-45.66-generic 4.4.21
  Uname: Linux 4.4.0-45-generic ppc64le
  ApportVersion: 2.20.1-0ubuntu2.1
  Architecture: ppc64el
  CrashReports:
   640:0:116:12295:2016-11-07 11:11:27.995107650 -0800:2016-11-07 
11:31:31.733627559 -0800:/var/crash/_usr_bin_apport-bug.0.crash
   644:0:116:0:2016-11-07 11:11:28.931104586 -0800:2016-11-07 
11:11:28.931104586 -0800:/var/crash/_usr_bin_apport-cli.0.None.hanging
  Date: Mon Nov  7 11:11:28 2016
  ExecutablePath: /usr/bin/apport-bug
  InstallationDate: Installed on 2016-11-05 (2 days ago)
  InstallationMedia: Ubuntu-Server 16.04.1 LTS "Xenial Xerus" - Release ppc64el 
(20160719)
  InterpreterPath: /usr/bin/python3.5
  PackageArchitecture: all
  ProcCmdline: /usr/bin/python3 /usr/bin/apport-cli --hanging
  ProcEnviron:
   TERM=linux
   PATH=(custom, no user)
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  ProcLoadAvg: 0.30 0.41 0.19 1/1132 2428
  ProcLocks:
   
  ProcSwaps:
   Filename                             Type            Size    Used    Priority
   /dev/sda3                               partition    157914048       0       
-1
  ProcVersion: Linux version 4.4.0-45-generic (buildd@bos01-ppc64el-030) (gcc 
version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #66-Ubuntu SMP Wed 
Oct 19 14:13:11 UTC 2016
  PythonArgs: ['/usr/bin/ubuntu-bug', '--hanging']
  SourcePackage: apport
  Title: apport-bug crashed with TypeError in run_hang(): int() argument must 
be a string, a bytes-like object or a number, not 'NoneType'
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups:
   
  cpu_cores: Number of cores present = 20
  cpu_coreson: Number of cores online = 20
  cpu_smt: SMT=8
  mtime.conffile..etc.apport.crashdb.conf: 2016-11-07T10:38:18.528739

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1639920/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to