Re: I/O errors during RAID check but no SMART errors

Dan Ritter Tue, 08 Oct 2024 08:45:18 -0700

Jochen Spieker wrote: 
> I have two disks in a RAID-1:
> 
> | $ cat /proc/mdstat
> | Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] 
> [raid4] [raid10]
> | md0 : active raid1 sdb1[2] sdc1[0]
> |       5860390400 blocks super 1.2 [2/2] [UU]
> |       bitmap: 5/44 pages [20KB], 65536KB chunk
> | 
> | unused devices: <none>
> 
> During the latest monthly check I got kernel messages like this:
> 
> | Oct 06 00:57:01 jigsaw kernel: md: data-check of RAID array md0
> | Oct 06 14:27:11 jigsaw kernel: ata3.00: exception Emask 0x0 SAct 0x4000000 
> SErr 0x0 action 0x0
> | Oct 06 14:27:11 jigsaw kernel: ata3.00: irq_stat 0x40000008
> | Oct 06 14:27:11 jigsaw kernel: ata3.00: failed command: READ FPDMA QUEUED
> | Oct 06 14:27:11 jigsaw kernel: ata3.00: cmd 
> 60/80:d0:80:74:f9/08:00:2d:02:00/40 tag 26 ncq dma 1114112 in
> |                                         res 
> 41/40:00:50:77:f9/00:00:2d:02:00/00 Emask 0x409 (media error) <F>
> | Oct 06 14:27:11 jigsaw kernel: ata3.00: status: { DRDY ERR }
> | Oct 06 14:27:11 jigsaw kernel: ata3.00: error: { UNC }
> | Oct 06 14:27:11 jigsaw kernel: ata3.00: configured for UDMA/133
> | Oct 06 14:27:11 jigsaw kernel: sd 2:0:0:0: [sdb] tag#26 FAILED Result: 
> hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=7s
> | Oct 06 14:27:11 jigsaw kernel: sd 2:0:0:0: [sdb] tag#26 Sense Key : Medium 
> Error [current]
> | Oct 06 14:27:11 jigsaw kernel: sd 2:0:0:0: [sdb] tag#26 Add. Sense: 
> Unrecovered read error - auto reallocate failed
> | Oct 06 14:27:11 jigsaw kernel: sd 2:0:0:0: [sdb] tag#26 CDB: Read(16) 88 00 
> 00 00 00 02 2d f9 74 80 00 00 08 80 00 00
> | Oct 06 14:27:11 jigsaw kernel: I/O error, dev sdb, sector 9361257600 op 
> 0x0:(READ) flags 0x0 phys_seg 150 prio class 3
> | Oct 06 14:27:11 jigsaw kernel: ata3: EH complete


If this happens once, it's just a thing that happened.

If it happens multiple times, it means that there's a hardware
error: sometimes a cable, rarely the SATA port, often the drive.

> The sector number mentioned at the bottom is increasing during the
> check.

So it repeats, and it's contiguous. That suggests a flaw in the
drive itself.


> The way I understand these messages is that some sectors cannot be read
> from sdb at all and the disk is unable to reallocate the data somewhere
> else (probably because it doesn't know what the data should be in the
> first place).

Yes.  

> The disk has been running continuously for seven years now and I am
> running out of space anyway, so I already ordered a replacement. But I
> do not fully understand what is happening.

The drive is dying, slowly. In this case it's starting with a
bad patch on a platter.


> Two of these message blocks end with this:
> 
> | Oct 07 10:26:12 jigsaw kernel: md/raid1:md0: sdb1: rescheduling sector 
> 10198068744
> 
> What does that mean for the other instances of this error? The data
> is still readable from the other disk in the RAID, right? Why doesn't md
> mention it? Why is the RAID still considered healthy? At some point I
> would expect the disk to be kicked from the RAID.

md will eventually do that, but not until it gets bad enough.
That could be quite noticeable.


> I unmounted the filesystem and performed a bad blocks scan (fsck.ext4
> -fcky) that did not find anything of importance (only "Inode x extent
> tree (at level 1) could be shorter/narrower"), and it also did not yield
> any of the above kernel messages. But another RAID check triggers these
> messages again, just with different sector numbers. The RAID is still
> healthy, though.

I don't think it is.

> Should this tell me that it is new sectors are dying all the time, or
> should this lead me to believe that a cable / the SATA controller is at
> fault? I don't even see any errors with smartctl:

If the sectors were effectively random, a cable fault would be
likely. If the sectors are contiguous or nearly-so, that's
definitely the disk.

 
> | SMART Attributes Data Structure revision number: 16
> | Vendor Specific SMART Attributes with Thresholds:
> | ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
> WHEN_FAILED RAW_VALUE
> |   1 Raw_Read_Error_Rate     0x002f   199   169   051    Pre-fail  Always    
>    -       81
> |   3 Spin_Up_Time            0x0027   198   197   021    Pre-fail  Always    
>    -       9100
> |   4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always    
>    -       83
> |   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always    
>    -       0
> |   7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always    
>    -       0
> |   9 Power_On_Hours          0x0032   016   016   000    Old_age   Always    
>    -       61794
> |  10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always    
>    -       0
> |  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always    
>    -       0
> |  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always    
>    -       82
> | 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always    
>    -       54
> | 193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always    
>    -       2219
> | 194 Temperature_Celsius     0x0022   119   116   000    Old_age   Always    
>    -       33
> | 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always    
>    -       0
> | 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always    
>    -       0
> | 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline   
>    -       0
> | 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always    
>    -       0
> | 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline   
>    -       43


This looks like a drive which is old and starting to wear out
but is not there yet. The raw read error rate is starting to
creep up but isn't at a threshold.


> I am still waiting for the result of a long self-test.
> 
> Do you think I should do remove the drive from the RAID immediately? Or
> should I suspect something else is at faula?t I perfer not to run the
> risk of losing the RAID completely when I keep on running on one disk
> while the new one is being shipped. I do have backups, but it would be
> great if I didn't need to restore.

If the disk is a few days away from being replaced, I would not
bother shutting it off, but I would assume that it is not a full
mirror and somehow having the good disk fail would be bad.

-dsr-

Re: I/O errors during RAID check but no SMART errors

Reply via email to