We encountered an instance that had a nvme failure very early on in boot today. I've updated our internal Canonical case as well as our Amazon case on this, but posting relevant details here as well for consistency:
# uname -a Linux XXX 4.4.0-1069-aws #79-Ubuntu SMP Mon Sep 24 15:01:41 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux # cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=16.04 DISTRIB_CODENAME=xenial DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS" # echo type $EC2_INSTANCE_TYPE type m5.xlarge # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme0n1 259:0 0 10G 0 disk / # ls -al /dev/nvme* /dev/xvd* /dev/sd* ls: cannot access '/dev/xvd*': No such file or directory crw------- 1 root root 248, 0 Oct 31 15:02 /dev/nvme0 brw-rw---- 1 root disk 259, 0 Oct 31 15:02 /dev/nvme0n1 lrwxrwxrwx 1 root root 7 Oct 31 15:02 /dev/sda1 -> nvme0n1 # dmesg | grep '63\.' [ 63.401466] nvme 0000:00:1f.0: I/O 0 QID 0 timeout, disable controller [ 63.505790] nvme 0000:00:1f.0: Cancelling I/O 0 QID 0 [ 63.505812] nvme 0000:00:1f.0: Identify Controller failed (-4) [ 63.507536] nvme 0000:00:1f.0: Removing after probe failure [ 63.507604] iounmap: bad address ffffc90001b40000 [ 63.508941] CPU: 1 PID: 351 Comm: kworker/1:3 Tainted: P O 4.4.0-1069-aws #79-Ubuntu [ 63.508943] Hardware name: Amazon EC2 m5.xlarge/, BIOS 1.0 10/16/2017 [ 63.508948] Workqueue: events nvme_remove_dead_ctrl_work [nvme] [ 63.508950] 0000000000000286 3501e2639044a4d2 ffff8800372bfce0 ffffffff923ffe03 [ 63.508952] ffff88040dd878f0 ffffc90001b40000 ffff8800372bfd00 ffffffff9206d3af [ 63.508954] ffff88040dd878f0 ffff88040dd87a58 ffff8800372bfd10 ffffffff9206d3ec [ 63.508956] Call Trace: [ 63.508961] [<ffffffff923ffe03>] dump_stack+0x63/0x90 [ 63.508965] [<ffffffff9206d3af>] iounmap.part.1+0x7f/0x90 [ 63.508967] [<ffffffff9206d3ec>] iounmap+0x2c/0x30 [ 63.508969] [<ffffffffc039abfa>] nvme_dev_unmap.isra.35+0x1a/0x30 [nvme] [ 63.508972] [<ffffffffc039bd1e>] nvme_remove+0xce/0xe0 [nvme] [ 63.508976] [<ffffffff92441e0e>] pci_device_remove+0x3e/0xc0 [ 63.508980] [<ffffffff9254f654>] __device_release_driver+0xa4/0x150 [ 63.508982] [<ffffffff9254f723>] device_release_driver+0x23/0x30 [ 63.508986] [<ffffffff9243abda>] pci_stop_bus_device+0x7a/0xa0 [ 63.508988] [<ffffffff9243ad3a>] pci_stop_and_remove_bus_device_locked+0x1a/0x30 [ 63.508990] [<ffffffffc039a62c>] nvme_remove_dead_ctrl_work+0x3c/0x50 [nvme] [ 63.508994] [<ffffffff9209d86b>] process_one_work+0x16b/0x490 [ 63.508996] [<ffffffff9209dbdb>] worker_thread+0x4b/0x4d0 [ 63.508998] [<ffffffff9209db90>] ? process_one_work+0x490/0x490 [ 63.509001] [<ffffffff920a3e47>] kthread+0xe7/0x100 [ 63.509005] [<ffffffff92823301>] ? __schedule+0x301/0x7f0 [ 63.509007] [<ffffffff920a3d60>] ? kthread_create_on_node+0x1e0/0x1e0 [ 63.509009] [<ffffffff92827e35>] ret_from_fork+0x55/0x80 [ 63.509011] [<ffffffff920a3d60>] ? kthread_create_on_node+0x1e0/0x1e0 [ 63.509013] Trying to free nonexistent resource <00000000febf8000-00000000febfbfff> # modinfo nvme filename: /lib/modules/4.4.0-1069-aws/kernel/drivers/nvme/host/nvme.ko version: 1.0 license: GPL author: Matthew Wilcox <wi...@linux.intel.com> srcversion: 5CF522443B009A8675C497B alias: pci:v0000106Bd00002001sv*sd*bc*sc*i* alias: pci:v*d*sv*sd*bc01sc08i02* alias: pci:v0000144Dd0000A822sv*sd*bc*sc*i* alias: pci:v0000144Dd0000A821sv*sd*bc*sc*i* alias: pci:v00001C58d00000003sv*sd*bc*sc*i* alias: pci:v00008086d00005845sv*sd*bc*sc*i* alias: pci:v00008086d0000F1A5sv*sd*bc*sc*i* alias: pci:v00008086d00000953sv*sd*bc*sc*i* depends: retpoline: Y intree: Y vermagic: 4.4.0-1069-aws SMP mod_unload modversions retpoline parm: admin_timeout:timeout in seconds for admin commands (uint) parm: io_timeout:timeout in seconds for I/O (uint) parm: shutdown_timeout:timeout in seconds for controller shutdown (byte) parm: use_threaded_interrupts:int parm: use_cmb_sqes:use controller's memory buffer for I/O SQes (bool) parm: nvme_major:int parm: nvme_char_major:int parm: default_ps_max_latency_us:max power saving latency for new devices; use PM QOS to change per device (ulong) # systool -m nvme -va Module = "nvme" Attributes: coresize = "65536" initsize = "0" initstate = "live" refcnt = "1" srcversion = "5CF522443B009A8675C497B" taint = "" uevent = <store method only> version = "1.0" Parameters: admin_timeout = "60" default_ps_max_latency_us= "100000" io_timeout = "4294967295" shutdown_timeout = "5" use_cmb_sqes = "Y" Sections: .bss = "0xffffffffc03a3780" .data = "0xffffffffc03a3000" .data.unlikely = "0xffffffffc03a33d8" .exit.text = "0xffffffffc03a0cea" .gnu.linkonce.this_module= "0xffffffffc03a3400" .init.text = "0xffffffffc03a8000" .note.gnu.build-id = "0xffffffffc03a1000" .parainstructions = "0xffffffffc03a1b88" .rodata = "0xffffffffc03a1060" .rodata.str1.1 = "0xffffffffc03a2349" .rodata.str1.8 = "0xffffffffc03a1d78" .smp_locks = "0xffffffffc03a1b28" .strtab = "0xffffffffc03abb08" .symtab = "0xffffffffc03a9000" .text = "0xffffffffc0397000" __bug_table = "0xffffffffc03a2be0" __kcrctab_gpl = "0xffffffffc03a1040" __ksymtab_gpl = "0xffffffffc03a1030" __ksymtab_strings = "0xffffffffc03a25d3" __mcount_loc = "0xffffffffc03a2730" __param = "0xffffffffc03a25f0" -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1788035 Title: nvme: avoid cqe corruption Status in linux package in Ubuntu: In Progress Status in linux source package in Xenial: Fix Released Bug description: To address customer-reported NVMe issue with instance types (notably c5 and m5) that expose EBS volumes as NVMe devices, this commit from mainline v4.6 should be backported to Xenial: d783e0bd02e700e7a893ef4fa71c69438ac1c276 nvme: avoid cqe corruption when update at the same time as read dmesg sample: [Wed Aug 15 01:11:21 2018] nvme 0000:00:1f.0: I/O 8 QID 1 timeout, aborting [Wed Aug 15 01:11:21 2018] nvme 0000:00:1f.0: I/O 9 QID 1 timeout, aborting [Wed Aug 15 01:11:21 2018] nvme 0000:00:1f.0: I/O 21 QID 2 timeout, aborting [Wed Aug 15 01:11:32 2018] nvme 0000:00:1f.0: I/O 10 QID 1 timeout, aborting [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: I/O 8 QID 1 timeout, reset controller [Wed Aug 15 01:11:51 2018] nvme nvme1: Abort status: 0x2 [Wed Aug 15 01:11:51 2018] nvme nvme1: Abort status: 0x2 [Wed Aug 15 01:11:51 2018] nvme nvme1: Abort status: 0x2 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 21 QID 2 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: completing aborted command with status: 0007 [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 83887751 [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 83887751 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 22 QID 2 [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 83887767 [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 83887767 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 23 QID 2 [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 83887769 [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 83887769 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 8 QID 1 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 9 QID 1 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: completing aborted command with status: 0007 [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 41943136 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 10 QID 1 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: completing aborted command with status: 0007 [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 6976 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 22 QID 1 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 23 QID 1 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 24 QID 1 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 25 QID 1 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 2 QID 0 [Wed Aug 15 01:11:51 2018] nvme nvme1: Abort status: 0x7 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: completing aborted command with status: fffffffc [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 96 [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): metadata I/O error: block 0x5000687 ("xlog_iodone") error 5 numblks 64 [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_do_force_shutdown(0x2) called from line 1197 of file /build/linux-c2Z51P/linux-4.4.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc075d428 [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_log_force: error -5 returned. [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): Log I/O Error Detected. Shutting down filesystem [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): Please umount the filesystem and rectify the problem(s) [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 872, lost async page write [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5. [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_iunlink_remove: xfs_imap_to_bp returned error -5. [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 873, lost async page write [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 874, lost async page write [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 875, lost async page write [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 876, lost async page write [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 877, lost async page write [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 878, lost async page write [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 879, lost async page write [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 880, lost async page write [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 881, lost async page write [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): metadata I/O error: block 0x5000697 ("xlog_iodone") error 5 numblks 64 [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_log_force: error -5 returned. [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_do_force_shutdown(0x2) called from line 1197 of file /build/linux-c2Z51P/linux-4.4.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc075d428 [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): metadata I/O error: block 0x5000699 ("xlog_iodone") error 5 numblks 64 [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_do_force_shutdown(0x2) called from line 1197 of file /build/linux-c2Z51P/linux-4.4.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc075d428 [Wed Aug 15 01:12:20 2018] XFS (nvme1n1): xfs_log_force: error -5 returned. [Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 22 QID 1 timeout, aborting [Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 23 QID 1 timeout, aborting [Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 24 QID 1 timeout, aborting [Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 25 QID 1 timeout, aborting [Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 24 QID 2 timeout, aborting [Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2 [Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2 [Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2 [Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2 [Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2 [Wed Aug 15 01:12:50 2018] XFS (nvme1n1): xfs_log_force: error -5 returned. [Wed Aug 15 01:12:52 2018] nvme 0000:00:1f.0: I/O 22 QID 1 timeout, reset controller [Wed Aug 15 01:13:21 2018] XFS (nvme1n1): xfs_log_force: error -5 returned. [Wed Aug 15 01:13:51 2018] XFS (nvme1n1): xfs_log_force: error -5 returned. [Wed Aug 15 01:14:21 2018] XFS (nvme1n1): xfs_log_force: error -5 returned. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1788035/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp