Hi,

On Tue, Oct 08, 2024 at 04:58:46PM +0200, Jochen Spieker wrote:
> The way I understand these messages is that some sectors cannot be read
> from sdb at all and the disk is unable to reallocate the data somewhere
> else (probably because it doesn't know what the data should be in the
> first place).

When MD receives a read error it does read the mirrored data and write
it back. If it can't do that it fails the disk, so you are not getting
there yet.

> Two of these message blocks end with this:
> 
> | Oct 07 10:26:12 jigsaw kernel: md/raid1:md0: sdb1: rescheduling sector 
> 10198068744
> 
> What does that mean for the other instances of this error?

I expect you probably have either no TLER value set or it's set higher
than the kernel's own timeout. By default consumer drives try very hard
to read data, taking a long time doing so when there's issues. The
kernel SCSI layer will try several times, so the drive's timeout is
multiplied. Only if this ends up exceeding 30s will you get a read
error, and the message from MD about rescheduling the sector.

> The data is still readable from the other disk in the RAID, right? Why
> doesn't md mention it?

I suspect that the times you saw an error from the SCSI layer but not
from MD, were times that the SCSI layer retried and got the data out
eventually.

When the SCSI layer times out of all its retries it actually resets the
drive and then the whole bus, and that often causes MD to drop the disk.
You haven;t mentioned any messages about resetting the bus so I think
you are not having that many retries.

The fact that you are having any is bad, though.

> Why is the RAID still considered healthy? At some point I
> would expect the disk to be kicked from the RAID.

This will happen when/if MD can't compensate by reading data from other
mirrors and writing it back. If a write fails, or a disk drops
out entirely, then MD will fail the device.

Hopefully the results of your SMART long self-test will help clear this
up. These things can be hard to track down though.

After you do resolve this you should set TLER to some sensible value
like 7 seconds. That is not your biggest concern right now though.

Here is a thing I wrote about it quite some time ago:

    
https://strugglers.net/~andy/mothballed-blog/2015/11/09/linux-software-raid-and-drive-timeouts/#how-to-check-set-drive-timeouts

> Do you think I should do remove the drive from the RAID immediately? Or
> should I suspect something else is at faula?t

The fact that you have no reallocated sectors and no pending sectors
and apparently all your writes are working makes me think there probably
isn't a fault with the drive but in some ways that is worse as it's easy
to replace a drive, not so eay to diagnose bad cables and marginal power
supplies etc etc.

I probably wouldn't remove it because it's better than nothing. I
probably would try the easy fix of replacing the drive first, if I could
afford that.

> I perfer not to run the risk of losing the RAID completely when I keep
> on running on one disk while the new one is being shipped.

I would make sure the timeouts are set correctly because if you do get
into the situation where the kernel is resetting the bus, that can
temporarily take away both drives at once which can cause MD to fail
both out and mark the array as faulty. It's relatively easy to do the
manual intervention required to start it up again but it is a stressful.

Thanks,
Andy

-- 
https://bitfolk.com/ -- No-nonsense VPS hosting

Reply via email to