Hi, On Tue, Oct 08, 2024 at 04:58:46PM +0200, Jochen Spieker wrote: > The way I understand these messages is that some sectors cannot be read > from sdb at all and the disk is unable to reallocate the data somewhere > else (probably because it doesn't know what the data should be in the > first place).
When MD receives a read error it does read the mirrored data and write it back. If it can't do that it fails the disk, so you are not getting there yet. > Two of these message blocks end with this: > > | Oct 07 10:26:12 jigsaw kernel: md/raid1:md0: sdb1: rescheduling sector > 10198068744 > > What does that mean for the other instances of this error? I expect you probably have either no TLER value set or it's set higher than the kernel's own timeout. By default consumer drives try very hard to read data, taking a long time doing so when there's issues. The kernel SCSI layer will try several times, so the drive's timeout is multiplied. Only if this ends up exceeding 30s will you get a read error, and the message from MD about rescheduling the sector. > The data is still readable from the other disk in the RAID, right? Why > doesn't md mention it? I suspect that the times you saw an error from the SCSI layer but not from MD, were times that the SCSI layer retried and got the data out eventually. When the SCSI layer times out of all its retries it actually resets the drive and then the whole bus, and that often causes MD to drop the disk. You haven;t mentioned any messages about resetting the bus so I think you are not having that many retries. The fact that you are having any is bad, though. > Why is the RAID still considered healthy? At some point I > would expect the disk to be kicked from the RAID. This will happen when/if MD can't compensate by reading data from other mirrors and writing it back. If a write fails, or a disk drops out entirely, then MD will fail the device. Hopefully the results of your SMART long self-test will help clear this up. These things can be hard to track down though. After you do resolve this you should set TLER to some sensible value like 7 seconds. That is not your biggest concern right now though. Here is a thing I wrote about it quite some time ago: https://strugglers.net/~andy/mothballed-blog/2015/11/09/linux-software-raid-and-drive-timeouts/#how-to-check-set-drive-timeouts > Do you think I should do remove the drive from the RAID immediately? Or > should I suspect something else is at faula?t The fact that you have no reallocated sectors and no pending sectors and apparently all your writes are working makes me think there probably isn't a fault with the drive but in some ways that is worse as it's easy to replace a drive, not so eay to diagnose bad cables and marginal power supplies etc etc. I probably wouldn't remove it because it's better than nothing. I probably would try the easy fix of replacing the drive first, if I could afford that. > I perfer not to run the risk of losing the RAID completely when I keep > on running on one disk while the new one is being shipped. I would make sure the timeouts are set correctly because if you do get into the situation where the kernel is resetting the bus, that can temporarily take away both drives at once which can cause MD to fail both out and mark the array as faulty. It's relatively easy to do the manual intervention required to start it up again but it is a stressful. Thanks, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting