We have an ubuntu server running a set of eight Samsung 980 Pro PCIe 4.0 NVMe SSDs (model MZ-V8P1T0BW) on Ubuntu 20.04.3 LTS (GNU/Linux 5.4.0-88-generic x86_64). We've seen this happen at least 5 times over the past month, and not always on the same SSD. We first saw it happen on 5.4.0-81. Some samples from dmesg are below.
This is a production system that runs a set of virtual desktop instances. Thankfully we use these in a zfs pool with four pairs of RAID 1 vdevs, so the only outage we've had so far is when it hit both members of a mirrored pair. After a reboot the SSDs come back up. [Mon Sep 6 12:58:36 2021] nvme nvme5: I/O 132 QID 46 timeout, aborting [Mon Sep 6 12:58:37 2021] nvme nvme5: I/O 133 QID 46 timeout, aborting [Mon Sep 6 12:58:39 2021] nvme nvme5: I/O 134 QID 46 timeout, aborting [Mon Sep 6 12:58:40 2021] nvme nvme5: I/O 135 QID 46 timeout, aborting [Mon Sep 6 12:58:40 2021] nvme nvme5: I/O 784 QID 48 timeout, aborting [Mon Sep 6 12:58:41 2021] nvme nvme5: I/O 136 QID 46 timeout, aborting [Mon Sep 6 12:58:41 2021] nvme nvme5: I/O 137 QID 46 timeout, aborting [Mon Sep 6 12:58:42 2021] nvme nvme5: I/O 492 QID 28 timeout, aborting [Mon Sep 6 12:59:07 2021] nvme nvme5: I/O 132 QID 46 timeout, reset controller [Mon Sep 6 12:59:38 2021] nvme nvme5: I/O 24 QID 0 timeout, reset controller [Mon Sep 6 13:00:29 2021] nvme nvme5: Device not ready; aborting reset [Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371 [Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371 [Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371 [Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371 [Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371 [Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371 [Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371 [Mon Sep 6 13:00:29 2021] nvme nvme5: Abort status: 0x371 [Mon Sep 6 13:00:33 2021] INFO: task txg_quiesce:2172 blocked for more than 120 seconds. [Mon Sep 6 13:00:33 2021] Tainted: P OE 5.4.0-81-generic #91-Ubuntu [Tue Sep 21 21:18:36 2021] nvme nvme2: I/O 175 QID 38 timeout, aborting [Tue Sep 21 21:18:37 2021] nvme nvme2: I/O 240 QID 26 timeout, aborting [Tue Sep 21 21:18:47 2021] nvme nvme2: I/O 718 QID 23 timeout, aborting [Tue Sep 21 21:18:56 2021] nvme nvme2: I/O 719 QID 23 timeout, aborting [Tue Sep 21 21:19:06 2021] nvme nvme2: I/O 175 QID 38 timeout, reset controller [Tue Sep 21 21:19:37 2021] nvme nvme2: I/O 17 QID 0 timeout, reset controller [Tue Sep 21 21:20:27 2021] nvme nvme2: Device not ready; aborting reset [Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371 [Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371 [Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371 [Tue Sep 21 21:20:27 2021] nvme nvme2: Abort status: 0x371 [Tue Sep 21 21:20:47 2021] nvme nvme2: Device not ready; aborting reset [Tue Sep 21 21:20:47 2021] nvme nvme2: Removing after probe failure status: -19 [Tue Sep 21 21:21:08 2021] nvme nvme2: Device not ready; aborting reset [Tue Oct 5 16:54:59 2021] nvme nvme6: I/O 1013 QID 38 timeout, aborting [Tue Oct 5 16:54:59 2021] nvme nvme6: I/O 727 QID 39 timeout, aborting [Tue Oct 5 16:55:03 2021] nvme nvme6: I/O 1014 QID 38 timeout, aborting [Tue Oct 5 16:55:05 2021] nvme nvme6: I/O 1015 QID 38 timeout, aborting [Tue Oct 5 16:55:25 2021] nvme nvme6: I/O 15 QID 21 timeout, aborting [Tue Oct 5 16:55:25 2021] nvme nvme6: I/O 408 QID 37 timeout, aborting [Tue Oct 5 16:55:29 2021] nvme nvme6: I/O 1013 QID 38 timeout, reset controller [Tue Oct 5 16:55:59 2021] nvme nvme6: I/O 11 QID 0 timeout, reset controller [Tue Oct 5 16:56:51 2021] nvme nvme6: Device not ready; aborting reset [Tue Oct 5 16:56:51 2021] nvme nvme6: Abort status: 0x371 [Tue Oct 5 16:56:51 2021] nvme nvme6: Abort status: 0x371 [Tue Oct 5 16:56:51 2021] nvme nvme6: Abort status: 0x371 [Tue Oct 5 16:56:51 2021] nvme nvme6: Abort status: 0x371 [Tue Oct 5 16:56:51 2021] nvme nvme6: Abort status: 0x371 [Tue Oct 5 16:56:51 2021] nvme nvme6: Abort status: 0x371 [Tue Oct 5 16:57:11 2021] nvme nvme6: Device not ready; aborting reset [Tue Oct 5 16:57:11 2021] nvme nvme6: Removing after probe failure status: -19 [Tue Oct 5 16:57:32 2021] nvme nvme6: Device not ready; aborting reset [Tue Oct 5 16:57:32 2021] blk_update_request: I/O error, dev nvme6n1, sector 842198232 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0 [Mon Oct 11 12:14:38 2021] nvme nvme2: I/O 306 QID 48 timeout, aborting [Mon Oct 11 12:14:39 2021] nvme nvme2: I/O 827 QID 14 timeout, aborting [Mon Oct 11 12:15:01 2021] nvme nvme2: I/O 828 QID 14 timeout, aborting [Mon Oct 11 12:15:05 2021] nvme nvme2: I/O 829 QID 14 timeout, aborting [Mon Oct 11 12:15:07 2021] nvme nvme2: I/O 830 QID 14 timeout, aborting [Mon Oct 11 12:15:08 2021] nvme nvme2: I/O 306 QID 48 timeout, reset controller [Mon Oct 11 12:15:38 2021] nvme nvme2: I/O 20 QID 0 timeout, reset controller [Mon Oct 11 12:16:29 2021] nvme nvme2: Device not ready; aborting reset [Mon Oct 11 12:16:29 2021] nvme nvme2: Abort status: 0x371 [Mon Oct 11 12:16:29 2021] nvme nvme2: Abort status: 0x371 [Mon Oct 11 12:16:29 2021] nvme nvme2: Abort status: 0x371 [Mon Oct 11 12:16:29 2021] nvme nvme2: Abort status: 0x371 [Mon Oct 11 12:16:29 2021] nvme nvme2: Abort status: 0x371 [Mon Oct 11 12:16:50 2021] nvme nvme2: Device not ready; aborting reset [Mon Oct 11 12:16:50 2021] nvme nvme2: Removing after probe failure status: -19 [Mon Oct 11 12:17:10 2021] nvme nvme2: Device not ready; aborting reset [Mon Oct 11 12:17:10 2021] blk_update_request: I/O error, dev nvme2n1, sector 1159355592 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0 [Mon Oct 11 12:17:10 2021] blk_update_request: I/O error, dev nvme2n1, sector 992254136 op 0x1:(WRITE) flags 0x0 phys_seg 3 prio class 0 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1910866 Title: nvme drive fails after some time Status in linux package in Ubuntu: Confirmed Status in linux source package in Groovy: Fix Released Status in Debian: New Bug description: Sorry for the vague title. I thought this was a hardware issue until someone else online mentioned their nvme drive goes "read only" after some time. I tend not to reboot my system much, so have a large journal. Either way this happens once in a while. The / drive is fine, but /home is on nvme which just disappears. I reboot and everything is fine. But leave it long enough and it'll fail again. Here's the most recent snippet about the nvme drive before I restarted the system. Jan 08 19:19:11 robot kernel: nvme nvme1: I/O 448 QID 5 timeout, aborting Jan 08 19:19:11 robot kernel: nvme nvme1: I/O 449 QID 5 timeout, aborting Jan 08 19:19:11 robot kernel: nvme nvme1: I/O 450 QID 5 timeout, aborting Jan 08 19:19:11 robot kernel: nvme nvme1: I/O 451 QID 5 timeout, aborting Jan 08 19:19:42 robot kernel: nvme nvme1: I/O 448 QID 5 timeout, reset controller Jan 08 19:19:42 robot kernel: nvme nvme1: I/O 22 QID 0 timeout, reset controller Jan 08 19:21:04 robot kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1 Jan 08 19:21:04 robot kernel: nvme nvme1: Abort status: 0x371 Jan 08 19:21:04 robot kernel: nvme nvme1: Abort status: 0x371 Jan 08 19:21:04 robot kernel: nvme nvme1: Abort status: 0x371 Jan 08 19:21:04 robot kernel: nvme nvme1: Abort status: 0x371 Jan 08 19:21:25 robot kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1 Jan 08 19:21:25 robot kernel: nvme nvme1: Removing after probe failure status: -19 Jan 08 19:21:41 robot kernel: INFO: task jbd2/nvme1n1p1-:731 blocked for more than 120 seconds. Jan 08 19:21:41 robot kernel: jbd2/nvme1n1p1- D 0 731 2 0x00004000 Jan 08 19:21:45 robot kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1 Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, sector 1920993784 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0 Jan 08 19:21:45 robot kernel: Buffer I/O error on dev nvme1n1p1, logical block 240123967, lost async page write Jan 08 19:21:45 robot kernel: EXT4-fs error (device nvme1n1p1): __ext4_find_entry:1535: inode #57278595: comm gsd-print-notif: reading directory lblock 0 Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, sector 1920993384 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0 Jan 08 19:21:45 robot kernel: Buffer I/O error on dev nvme1n1p1, logical block 240123917, lost async page write Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, sector 1920993320 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0 Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, sector 1833166472 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0 Jan 08 19:21:45 robot kernel: Buffer I/O error on dev nvme1n1p1, logical block 240123909, lost async page write Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, sector 1909398624 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0 Jan 08 19:21:45 robot kernel: Buffer I/O error on dev nvme1n1p1, logical block 0, lost sync page write Jan 08 19:21:45 robot kernel: EXT4-fs (nvme1n1p1): I/O error while writing superblock ProblemType: Bug DistroRelease: Ubuntu 20.10 Package: linux-image-5.8.0-34-generic 5.8.0-34.37 ProcVersionSignature: Ubuntu 5.8.0-34.37-generic 5.8.18 Uname: Linux 5.8.0-34-generic x86_64 NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair ApportVersion: 2.20.11-0ubuntu50.3 Architecture: amd64 CasperMD5CheckResult: skip CurrentDesktop: ubuntu:GNOME Date: Sat Jan 9 11:56:28 2021 InstallationDate: Installed on 2020-08-15 (146 days ago) InstallationMedia: Ubuntu 20.04.1 LTS "Focal Fossa" - Release amd64 (20200731) MachineType: Intel Corporation NUC8i7HVK ProcFB: 0 amdgpudrmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.8.0-34-generic root=UUID=c212e9d4-a049-4da0-8e34-971cb7414e60 ro quiet splash vt.handoff=7 RebootRequiredPkgs: linux-image-5.8.0-36-generic linux-base RelatedPackageVersions: linux-restricted-modules-5.8.0-34-generic N/A linux-backports-modules-5.8.0-34-generic N/A linux-firmware 1.190.2 SourcePackage: linux UpgradeStatus: Upgraded to groovy on 2020-09-20 (110 days ago) dmi.bios.date: 12/17/2018 dmi.bios.release: 5.6 dmi.bios.vendor: Intel Corp. dmi.bios.version: HNKBLi70.86A.0053.2018.1217.1739 dmi.board.name: NUC8i7HVB dmi.board.vendor: Intel Corporation dmi.board.version: J68196-502 dmi.chassis.type: 3 dmi.chassis.vendor: Intel Corporation dmi.chassis.version: 2.0 dmi.modalias: dmi:bvnIntelCorp.:bvrHNKBLi70.86A.0053.2018.1217.1739:bd12/17/2018:br5.6:svnIntelCorporation:pnNUC8i7HVK:pvrJ71485-502:rvnIntelCorporation:rnNUC8i7HVB:rvrJ68196-502:cvnIntelCorporation:ct3:cvr2.0: dmi.product.family: Intel NUC dmi.product.name: NUC8i7HVK dmi.product.version: J71485-502 dmi.sys.vendor: Intel Corporation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp