We have an ubuntu server running a set of eight Samsung 980 Pro PCIe 4.0
NVMe SSDs (model MZ-V8P1T0BW) on Ubuntu 20.04.3 LTS (GNU/Linux
5.4.0-88-generic x86_64). We've seen this happen at least 5 times over
the past month, and not always on the same SSD. We first saw it happen
on 5.4.0-81. Some samples from dmesg are below.
This is a production system that runs a set of virtual desktop
instances. Thankfully we use these in a zfs pool with four pairs of RAID
1 vdevs, so the only outage we've had so far is when it hit both members
of a mirrored pair. After a reboot the SSDs come back up.

[Mon Sep  6 12:58:36 2021] nvme nvme5: I/O 132 QID 46 timeout, aborting
[Mon Sep  6 12:58:37 2021] nvme nvme5: I/O 133 QID 46 timeout, aborting
[Mon Sep  6 12:58:39 2021] nvme nvme5: I/O 134 QID 46 timeout, aborting
[Mon Sep  6 12:58:40 2021] nvme nvme5: I/O 135 QID 46 timeout, aborting
[Mon Sep  6 12:58:40 2021] nvme nvme5: I/O 784 QID 48 timeout, aborting
[Mon Sep  6 12:58:41 2021] nvme nvme5: I/O 136 QID 46 timeout, aborting
[Mon Sep  6 12:58:41 2021] nvme nvme5: I/O 137 QID 46 timeout, aborting
[Mon Sep  6 12:58:42 2021] nvme nvme5: I/O 492 QID 28 timeout, aborting
[Mon Sep  6 12:59:07 2021] nvme nvme5: I/O 132 QID 46 timeout, reset controller
[Mon Sep  6 12:59:38 2021] nvme nvme5: I/O 24 QID 0 timeout, reset controller
[Mon Sep  6 13:00:29 2021] nvme nvme5: Device not ready; aborting reset
[Mon Sep  6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep  6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep  6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep  6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep  6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep  6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep  6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep  6 13:00:29 2021] nvme nvme5: Abort status: 0x371
[Mon Sep  6 13:00:33 2021] INFO: task txg_quiesce:2172 blocked for more than 
120 seconds.
[Mon Sep  6 13:00:33 2021]       Tainted: P           OE     5.4.0-81-generic 
#91-Ubuntu

[Tue Sep 21 21:18:36 2021] nvme nvme2: I/O 175 QID 38 timeout, aborting
[Tue Sep 21 21:18:37 2021] nvme nvme2: I/O 240 QID 26 timeout, aborting
[Tue Sep 21 21:18:47 2021] nvme nvme2: I/O 718 QID 23 timeout, aborting
[Tue Sep 21 21:18:56 2021] nvme nvme2: I/O 719 QID 23 timeout, aborting
[Tue Sep 21 21:19:06 2021] nvme nvme2: I/O 175 QID 38 timeout, reset controller
[Tue Sep 21 21:19:37 2021] nvme nvme2: I/O 17 QID 0 timeout, reset controller
[Tue Sep 21 21:20:27 2021] nvme nvme2: Device not ready; aborting reset
[Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371
[Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371
[Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371
[Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371
[Tue Sep 21 21:20:47 2021] nvme nvme2: Device not ready; aborting reset
[Tue Sep 21 21:20:47 2021] nvme nvme2: Removing after probe failure status: -19
[Tue Sep 21 21:21:08 2021] nvme nvme2: Device not ready; aborting reset

[Tue Oct  5 16:54:59 2021] nvme nvme6: I/O 1013 QID 38 timeout, aborting
[Tue Oct  5 16:54:59 2021] nvme nvme6: I/O 727 QID 39 timeout, aborting
[Tue Oct  5 16:55:03 2021] nvme nvme6: I/O 1014 QID 38 timeout, aborting
[Tue Oct  5 16:55:05 2021] nvme nvme6: I/O 1015 QID 38 timeout, aborting
[Tue Oct  5 16:55:25 2021] nvme nvme6: I/O 15 QID 21 timeout, aborting
[Tue Oct  5 16:55:25 2021] nvme nvme6: I/O 408 QID 37 timeout, aborting
[Tue Oct  5 16:55:29 2021] nvme nvme6: I/O 1013 QID 38 timeout, reset controller
[Tue Oct  5 16:55:59 2021] nvme nvme6: I/O 11 QID 0 timeout, reset controller
[Tue Oct  5 16:56:51 2021] nvme nvme6: Device not ready; aborting reset
[Tue Oct  5 16:56:51 2021] nvme nvme6: Abort status: 0x371
[Tue Oct  5 16:56:51 2021] nvme nvme6: Abort status: 0x371
[Tue Oct  5 16:56:51 2021] nvme nvme6: Abort status: 0x371
[Tue Oct  5 16:56:51 2021] nvme nvme6: Abort status: 0x371
[Tue Oct  5 16:56:51 2021] nvme nvme6: Abort status: 0x371
[Tue Oct  5 16:56:51 2021] nvme nvme6: Abort status: 0x371
[Tue Oct  5 16:57:11 2021] nvme nvme6: Device not ready; aborting reset
[Tue Oct  5 16:57:11 2021] nvme nvme6: Removing after probe failure status: -19
[Tue Oct  5 16:57:32 2021] nvme nvme6: Device not ready; aborting reset
[Tue Oct  5 16:57:32 2021] blk_update_request: I/O error, dev nvme6n1, sector 
842198232 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0

[Mon Oct 11 12:14:38 2021] nvme nvme2: I/O 306 QID 48 timeout, aborting
[Mon Oct 11 12:14:39 2021] nvme nvme2: I/O 827 QID 14 timeout, aborting
[Mon Oct 11 12:15:01 2021] nvme nvme2: I/O 828 QID 14 timeout, aborting
[Mon Oct 11 12:15:05 2021] nvme nvme2: I/O 829 QID 14 timeout, aborting
[Mon Oct 11 12:15:07 2021] nvme nvme2: I/O 830 QID 14 timeout, aborting
[Mon Oct 11 12:15:08 2021] nvme nvme2: I/O 306 QID 48 timeout, reset controller
[Mon Oct 11 12:15:38 2021] nvme nvme2: I/O 20 QID 0 timeout, reset controller
[Mon Oct 11 12:16:29 2021] nvme nvme2: Device not ready; aborting reset
[Mon Oct 11 12:16:29 2021] nvme nvme2: Abort status: 0x371
[Mon Oct 11 12:16:29 2021] nvme nvme2: Abort status: 0x371
[Mon Oct 11 12:16:29 2021] nvme nvme2: Abort status: 0x371
[Mon Oct 11 12:16:29 2021] nvme nvme2: Abort status: 0x371
[Mon Oct 11 12:16:29 2021] nvme nvme2: Abort status: 0x371
[Mon Oct 11 12:16:50 2021] nvme nvme2: Device not ready; aborting reset
[Mon Oct 11 12:16:50 2021] nvme nvme2: Removing after probe failure status: -19
[Mon Oct 11 12:17:10 2021] nvme nvme2: Device not ready; aborting reset
[Mon Oct 11 12:17:10 2021] blk_update_request: I/O error, dev nvme2n1, sector 
1159355592 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Mon Oct 11 12:17:10 2021] blk_update_request: I/O error, dev nvme2n1, sector 
992254136 op 0x1:(WRITE) flags 0x0 phys_seg 3 prio class 0

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Groovy:
  Fix Released
Status in Debian:
  New

Bug description:
  Sorry for the vague title. I thought this was a hardware issue until
  someone else online mentioned their nvme drive goes "read only" after
  some time. I tend not to reboot my system much, so have a large
  journal. Either way this happens once in a while. The / drive is fine,
  but /home is on nvme which just disappears. I reboot and everything is
  fine. But leave it long enough and it'll fail again.

  Here's the most recent snippet about the nvme drive before I restarted
  the system.

  Jan 08 19:19:11 robot kernel: nvme nvme1: I/O 448 QID 5 timeout, aborting     
                                                                                
                           
  Jan 08 19:19:11 robot kernel: nvme nvme1: I/O 449 QID 5 timeout, aborting     
                                                                                
                           
  Jan 08 19:19:11 robot kernel: nvme nvme1: I/O 450 QID 5 timeout, aborting     
                                                                                
                           
  Jan 08 19:19:11 robot kernel: nvme nvme1: I/O 451 QID 5 timeout, aborting     
                                                                                
                           
  Jan 08 19:19:42 robot kernel: nvme nvme1: I/O 448 QID 5 timeout, reset 
controller
  Jan 08 19:19:42 robot kernel: nvme nvme1: I/O 22 QID 0 timeout, reset 
controller
  Jan 08 19:21:04 robot kernel: nvme nvme1: Device not ready; aborting reset, 
CSTS=0x1
  Jan 08 19:21:04 robot kernel: nvme nvme1: Abort status: 0x371
  Jan 08 19:21:04 robot kernel: nvme nvme1: Abort status: 0x371
  Jan 08 19:21:04 robot kernel: nvme nvme1: Abort status: 0x371
  Jan 08 19:21:04 robot kernel: nvme nvme1: Abort status: 0x371
  Jan 08 19:21:25 robot kernel: nvme nvme1: Device not ready; aborting reset, 
CSTS=0x1
  Jan 08 19:21:25 robot kernel: nvme nvme1: Removing after probe failure 
status: -19
  Jan 08 19:21:41 robot kernel: INFO: task jbd2/nvme1n1p1-:731 blocked for more 
than 120 seconds.
  Jan 08 19:21:41 robot kernel: jbd2/nvme1n1p1- D    0   731      2 0x00004000
  Jan 08 19:21:45 robot kernel: nvme nvme1: Device not ready; aborting reset, 
CSTS=0x1
  Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, 
sector 1920993784 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
  Jan 08 19:21:45 robot kernel: Buffer I/O error on dev nvme1n1p1, logical 
block 240123967, lost async page write
  Jan 08 19:21:45 robot kernel: EXT4-fs error (device nvme1n1p1): 
__ext4_find_entry:1535: inode #57278595: comm gsd-print-notif: reading 
directory lblock 0
  Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, 
sector 1920993384 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
  Jan 08 19:21:45 robot kernel: Buffer I/O error on dev nvme1n1p1, logical 
block 240123917, lost async page write
  Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, 
sector 1920993320 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
  Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, 
sector 1833166472 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
  Jan 08 19:21:45 robot kernel: Buffer I/O error on dev nvme1n1p1, logical 
block 240123909, lost async page write
  Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, 
sector 1909398624 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
  Jan 08 19:21:45 robot kernel: Buffer I/O error on dev nvme1n1p1, logical 
block 0, lost sync page write
  Jan 08 19:21:45 robot kernel: EXT4-fs (nvme1n1p1): I/O error while writing 
superblock

  ProblemType: Bug
  DistroRelease: Ubuntu 20.10
  Package: linux-image-5.8.0-34-generic 5.8.0-34.37
  ProcVersionSignature: Ubuntu 5.8.0-34.37-generic 5.8.18
  Uname: Linux 5.8.0-34-generic x86_64
  NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
  ApportVersion: 2.20.11-0ubuntu50.3
  Architecture: amd64
  CasperMD5CheckResult: skip
  CurrentDesktop: ubuntu:GNOME
  Date: Sat Jan  9 11:56:28 2021
  InstallationDate: Installed on 2020-08-15 (146 days ago)
  InstallationMedia: Ubuntu 20.04.1 LTS "Focal Fossa" - Release amd64 (20200731)
  MachineType: Intel Corporation NUC8i7HVK
  ProcFB: 0 amdgpudrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.8.0-34-generic 
root=UUID=c212e9d4-a049-4da0-8e34-971cb7414e60 ro quiet splash vt.handoff=7
  RebootRequiredPkgs:
   linux-image-5.8.0-36-generic
   linux-base
  RelatedPackageVersions:
   linux-restricted-modules-5.8.0-34-generic N/A
   linux-backports-modules-5.8.0-34-generic  N/A
   linux-firmware                            1.190.2
  SourcePackage: linux
  UpgradeStatus: Upgraded to groovy on 2020-09-20 (110 days ago)
  dmi.bios.date: 12/17/2018
  dmi.bios.release: 5.6
  dmi.bios.vendor: Intel Corp.
  dmi.bios.version: HNKBLi70.86A.0053.2018.1217.1739
  dmi.board.name: NUC8i7HVB
  dmi.board.vendor: Intel Corporation
  dmi.board.version: J68196-502
  dmi.chassis.type: 3
  dmi.chassis.vendor: Intel Corporation
  dmi.chassis.version: 2.0
  dmi.modalias: 
dmi:bvnIntelCorp.:bvrHNKBLi70.86A.0053.2018.1217.1739:bd12/17/2018:br5.6:svnIntelCorporation:pnNUC8i7HVK:pvrJ71485-502:rvnIntelCorporation:rnNUC8i7HVB:rvrJ68196-502:cvnIntelCorporation:ct3:cvr2.0:
  dmi.product.family: Intel NUC
  dmi.product.name: NUC8i7HVK
  dmi.product.version: J71485-502
  dmi.sys.vendor: Intel Corporation

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to