Re: I/O errors during RAID check but no SMART errors

Jochen Spieker Tue, 08 Oct 2024 13:29:19 -0700

Andy Smith:
> On Tue, Oct 08, 2024 at 04:58:46PM +0200, Jochen Spieker wrote:
>> The way I understand these messages is that some sectors cannot be read
>> from sdb at all and the disk is unable to reallocate the data somewhere
>> else (probably because it doesn't know what the data should be in the
>> first place).
> 
> When MD receives a read error it does read the mirrored data and write
> it back. If it can't do that it fails the disk, so you are not getting
> there yet.


Okay, that's good, I guess.

>> Two of these message blocks end with this:
>> 
>>| Oct 07 10:26:12 jigsaw kernel: md/raid1:md0: sdb1: rescheduling sector 
>>10198068744
>> 
>> What does that mean for the other instances of this error?
> 
> I expect you probably have either no TLER value set

Thanks a lot, I had never heard of that before. But by chance my WD REDs
actually seem to come with a default of 7 seconds:

| /dev/sdb:
| smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-25-amd64] (local build)
| Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
| 
| SCT Error Recovery Control:
|            Read:     70 (7.0 seconds)
|           Write:     70 (7.0 seconds)


> or it's set higher
> than the kernel's own timeout. By default consumer drives try very hard
> to read data, taking a long time doing so when there's issues. The
> kernel SCSI layer will try several times, so the drive's timeout is
> multiplied. Only if this ends up exceeding 30s will you get a read
> error, and the message from MD about rescheduling the sector.

That makes sense. And might also explain why the disk does not report
any reallocated sectors (yet).

> Hopefully the results of your SMART long self-test will help clear this
> up. These things can be hard to track down though.

10% remaining … "long" is really long.

> After you do resolve this you should set TLER to some sensible value
> like 7 seconds. That is not your biggest concern right now though.
> 
> Here is a thing I wrote about it quite some time ago:
> 
>     
> https://strugglers.net/~andy/mothballed-blog/2015/11/09/linux-software-raid-and-drive-timeouts/#how-to-check-set-drive-timeouts

Thanks a lot again.

>> Do you think I should do remove the drive from the RAID immediately? Or
>> should I suspect something else is at faula?t
> 
> The fact that you have no reallocated sectors and no pending sectors
> and apparently all your writes are working makes me think there probably
> isn't a fault with the drive but in some ways that is worse as it's easy
> to replace a drive, not so eay to diagnose bad cables and marginal power
> supplies etc etc.

See my other reply, the sector numbers do not appear to be random, so I
hope that it is actually the disk.

>> I perfer not to run the risk of losing the RAID completely when I keep
>> on running on one disk while the new one is being shipped.
> 
> I would make sure the timeouts are set correctly because if you do get
> into the situation where the kernel is resetting the bus, that can
> temporarily take away both drives at once which can cause MD to fail
> both out and mark the array as faulty. It's relatively easy to do the
> manual intervention required to start it up again but it is a stressful.

I guess if that really happens I will strongly consider to just restore
from backup. I just need to think hard about the things that I have
excluded from backup deliberately. ^^ But the new disk is expected to be
delivered tomorrow, so I keep my fingers crossed. I mean, that is why I
am using RAID1 in the first place.

J.
-- 
I use a Playstation to block out the existence of my partner.
[Agree]   [Disagree]
                 <http://archive.slowlydownward.com/NODATA/data_enter2.html>

signature.asc
Description: PGP signature

Re: I/O errors during RAID check but no SMART errors

Reply via email to