Re: I/O errors during RAID check but no SMART errors

Michael Kjörling Tue, 08 Oct 2024 14:17:57 -0700

On 8 Oct 2024 11:29 -0400, from d...@randomstring.org (Dan Ritter):
>> The disk has been running continuously for seven years now and I am
>> running out of space anyway, so I already ordered a replacement. But I
>> do not fully understand what is happening.
> 
> The drive is dying, slowly. In this case it's starting with a
> bad patch on a platter.


That would be my take too. The LBA sectors reported in a different
post in this thread being as close as they appear to be would also
corroborate the platter issue theory.


>> | SMART Attributes Data Structure revision number: 16
>> | Vendor Specific SMART Attributes with Thresholds:
>> | ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
>> WHEN_FAILED RAW_VALUE
>> |   1 Raw_Read_Error_Rate     0x002f   199   169   051    Pre-fail  Always   
>>     -       81
>> |   3 Spin_Up_Time            0x0027   198   197   021    Pre-fail  Always   
>>     -       9100
>> |   4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always   
>>     -       83
>> |   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always   
>>     -       0
>> |   7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always   
>>     -       0
>> |   9 Power_On_Hours          0x0032   016   016   000    Old_age   Always   
>>     -       61794
>> |  10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always   
>>     -       0
>> |  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always   
>>     -       0
>> |  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always   
>>     -       82
>> | 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always   
>>     -       54
>> | 193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always   
>>     -       2219
>> | 194 Temperature_Celsius     0x0022   119   116   000    Old_age   Always   
>>     -       33
>> | 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always   
>>     -       0
>> | 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always   
>>     -       0
>> | 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline  
>>     -       0
>> | 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always   
>>     -       0
>> | 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline  
>>     -       43
> 
> This looks like a drive which is old and starting to wear out
> but is not there yet. The raw read error rate is starting to
> creep up but isn't at a threshold.

I agree. The almost 62000 hours is well over 7 years of run time, and
based on the start/stop count and power cycle count it's been running
basically continuously for that time (which is generally good for
longevity, as long as it's not subjected to excessive heat). It's
entirely possible that the mechanical components are degrading; which
in turn might also be interfering with the physical properties of data
storage. Yes, servo tracks and such things are supposed to catch and
compensate for that; but it might not be quite that bad yet.

Sometimes HDDs fail with a bang, and sometimes they fail with a
whimper.

Also note that some disks actually lie in SMART data. I don't know if
yours does, but I would definitely question a value of 0 for failed
(current pending and offline uncorrectable) _and_ reallocated sectors
for a disk that's reporting I/O errors, for example. _At least_ one of
those should be >0 for a truthful storage device in that situation.

What I would not do at this point is subject it to more physical
stress than unavoidable. Unless you absolutely must, do not physically
unplug or remove that disk before the RAID array has resilvered onto
the new disk. It's currently providing value being a second source of
truth about what's stored; you don't want to remove it and then find
during the resilver that the other current disk has a problem.

-- 
Michael Kjörling                     🔗 https://michael.kjorling.se
“Remember when, on the Internet, nobody cared that you were a dog?”

Re: I/O errors during RAID check but no SMART errors

Reply via email to