Andy Smith: > On Tue, Oct 08, 2024 at 04:58:46PM +0200, Jochen Spieker wrote: >> The way I understand these messages is that some sectors cannot be read >> from sdb at all and the disk is unable to reallocate the data somewhere >> else (probably because it doesn't know what the data should be in the >> first place). > > When MD receives a read error it does read the mirrored data and write > it back. If it can't do that it fails the disk, so you are not getting > there yet.
Okay, that's good, I guess. >> Two of these message blocks end with this: >> >>| Oct 07 10:26:12 jigsaw kernel: md/raid1:md0: sdb1: rescheduling sector >>10198068744 >> >> What does that mean for the other instances of this error? > > I expect you probably have either no TLER value set Thanks a lot, I had never heard of that before. But by chance my WD REDs actually seem to come with a default of 7 seconds: | /dev/sdb: | smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-25-amd64] (local build) | Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org | | SCT Error Recovery Control: | Read: 70 (7.0 seconds) | Write: 70 (7.0 seconds) > or it's set higher > than the kernel's own timeout. By default consumer drives try very hard > to read data, taking a long time doing so when there's issues. The > kernel SCSI layer will try several times, so the drive's timeout is > multiplied. Only if this ends up exceeding 30s will you get a read > error, and the message from MD about rescheduling the sector. That makes sense. And might also explain why the disk does not report any reallocated sectors (yet). > Hopefully the results of your SMART long self-test will help clear this > up. These things can be hard to track down though. 10% remaining … "long" is really long. > After you do resolve this you should set TLER to some sensible value > like 7 seconds. That is not your biggest concern right now though. > > Here is a thing I wrote about it quite some time ago: > > > https://strugglers.net/~andy/mothballed-blog/2015/11/09/linux-software-raid-and-drive-timeouts/#how-to-check-set-drive-timeouts Thanks a lot again. >> Do you think I should do remove the drive from the RAID immediately? Or >> should I suspect something else is at faula?t > > The fact that you have no reallocated sectors and no pending sectors > and apparently all your writes are working makes me think there probably > isn't a fault with the drive but in some ways that is worse as it's easy > to replace a drive, not so eay to diagnose bad cables and marginal power > supplies etc etc. See my other reply, the sector numbers do not appear to be random, so I hope that it is actually the disk. >> I perfer not to run the risk of losing the RAID completely when I keep >> on running on one disk while the new one is being shipped. > > I would make sure the timeouts are set correctly because if you do get > into the situation where the kernel is resetting the bus, that can > temporarily take away both drives at once which can cause MD to fail > both out and mark the array as faulty. It's relatively easy to do the > manual intervention required to start it up again but it is a stressful. I guess if that really happens I will strongly consider to just restore from backup. I just need to think hard about the things that I have excluded from backup deliberately. ^^ But the new disk is expected to be delivered tomorrow, so I keep my fingers crossed. I mean, that is why I am using RAID1 in the first place. J. -- I use a Playstation to block out the existence of my partner. [Agree] [Disagree] <http://archive.slowlydownward.com/NODATA/data_enter2.html>
signature.asc
Description: PGP signature