Back in October I setup a software RAID 5 array using MD. I used 5x300
gig SATA-II drives, running on two Promise TX4 SATAII controllers (the
new ones with NCQ). One controller connected to two drives, and the
other to three.
A few days ago, after moving to a new house, I set up the server
containing the array and tried to connect to it. I couldn't reach the
server through the intranet, so I hooked up a keyboard and monitor to
see what was up. When I peered in, I saw that the kernel hadn't even
finished its boot procedure. Right as md was loaded by the kernel (it is
built in, not a module), there was a call stack and a kernel error "IRQ
193: nobody cared!" or something similar. Following that were repeating
messages about SCSI commands failing on three of my drives I believe.
Rebooting the machine didn't make the behavior go away. I powered it off
and reseated all of the SATA connectors. This time, when booting up, I
made progress. Here is what syslog said upon autodetecting the MD array:
Sep 11 23:46:57 localhost kernel: md: Autodetecting RAID arrays.
Sep 11 23:46:57 localhost kernel: md: autorun ...
Sep 11 23:46:57 localhost kernel: md: considering sdf1 ...
Sep 11 23:46:57 localhost kernel: md: adding sdf1 ...
Sep 11 23:46:57 localhost kernel: md: adding sde1 ...
Sep 11 23:46:57 localhost kernel: md: adding sdd1 ...
Sep 11 23:46:57 localhost kernel: md: adding sdc1 ...
Sep 11 23:46:57 localhost kernel: md: adding sdb1 ...
Sep 11 23:46:57 localhost kernel: md: created md0
Sep 11 23:46:57 localhost kernel: md: bind<sdb1>
Sep 11 23:46:57 localhost kernel: md: bind<sdc1>
Sep 11 23:46:57 localhost kernel: md: bind<sdd1>
Sep 11 23:46:57 localhost kernel: md: bind<sde1>
Sep 11 23:46:57 localhost kernel: md: bind<sdf1>
Sep 11 23:46:57 localhost kernel: md: running:
<sdf1><sde1><sdd1><sdc1><sdb1>
Sep 11 23:46:57 localhost kernel: md: kicking non-fresh sdc1 from array!
Sep 11 23:46:57 localhost kernel: md: unbind<sdc1>
Sep 11 23:46:57 localhost kernel: md: export_rdev(sdc1)
Sep 11 23:46:57 localhost kernel: md: md0: raid array is not clean --
starting background reconstruction
Sep 11 23:46:57 localhost kernel: raid5: device sdf1 operational as raid
disk 4
Sep 11 23:46:57 localhost kernel: raid5: device sde1 operational as raid
disk 3
Sep 11 23:46:57 localhost kernel: raid5: device sdd1 operational as raid
disk 2
Sep 11 23:46:57 localhost kernel: raid5: device sdb1 operational as raid
disk 0
Sep 11 23:46:57 localhost kernel: raid5: cannot start dirty degraded
array for md0
Sep 11 23:46:57 localhost kernel: RAID5 conf printout:
Sep 11 23:46:57 localhost kernel: --- rd:5 wd:4 fd:1
Sep 11 23:46:57 localhost kernel: disk 0, o:1, dev:sdb1
Sep 11 23:46:57 localhost kernel: disk 2, o:1, dev:sdd1
Sep 11 23:46:57 localhost kernel: disk 3, o:1, dev:sde1
Sep 11 23:46:57 localhost kernel: disk 4, o:1, dev:sdf1
Sep 11 23:46:57 localhost kernel: raid5: failed to run raid set md0
Sep 11 23:46:57 localhost kernel: md: pers->run() failed ...
Sep 11 23:46:57 localhost kernel: md: do_md_run() returned -22
Sep 11 23:46:57 localhost kernel: md: md0 stopped.
Sep 11 23:46:57 localhost kernel: md: unbind<sdf1>
Sep 11 23:46:57 localhost kernel: md: export_rdev(sdf1)
Sep 11 23:46:57 localhost kernel: md: unbind<sde1>
Sep 11 23:46:57 localhost kernel: md: export_rdev(sde1)
Sep 11 23:46:57 localhost kernel: md: unbind<sdd1>
Sep 11 23:46:57 localhost kernel: md: export_rdev(sdd1)
Sep 11 23:46:57 localhost kernel: md: unbind<sdb1>
Sep 11 23:46:57 localhost kernel: md: export_rdev(sdb1)
Sep 11 23:46:57 localhost kernel: md: ... autorun DONE.
Note the message about sdc being non-fresh. Also note that the array is
both DIRTY and DEGRADED. Degraded (I'm guessing) because sdc is detected
as failed, and dirty because the machine was powered off when it was
erroring and the array wasn't able to flush properly.
I played around with mdadm but I could never get the array to start. All
of the superblocks were intact, including sdc. Finally, I ran "mdrun",
which managed to start the array. Here is the logging associated with
this command:
Sep 12 01:04:37 localhost kernel: md: md0 stopped.
Sep 12 01:04:37 localhost kernel: md: bind<sdc>
Sep 12 01:04:37 localhost kernel: md: bind<sdd>
Sep 12 01:04:37 localhost kernel: md: bind<sdf>
Sep 12 01:04:37 localhost kernel: md: bind<sde>
Sep 12 01:04:37 localhost kernel: md: bind<sdb>
Sep 12 01:04:37 localhost kernel: md: md0: raid array is not clean --
starting background reconstruction
Sep 12 01:04:37 localhost kernel: raid5: device sdb operational as raid
disk 0
Sep 12 01:04:37 localhost kernel: raid5: device sde operational as raid
disk 4
Sep 12 01:04:37 localhost kernel: raid5: device sdf operational as raid
disk 3
Sep 12 01:04:37 localhost kernel: raid5: device sdd operational as raid
disk 2
Sep 12 01:04:37 localhost kernel: raid5: device sdc operational as raid
disk 1
Sep 12 01:04:37 localhost kernel: raid5: allocated 5248kB for md0
Sep 12 01:04:37 localhost kernel: raid5: raid level 5 set md0 active
with 5 out of 5 devices, algorithm 2
Sep 12 01:04:37 localhost kernel: RAID5 conf printout:
Sep 12 01:04:37 localhost kernel: --- rd:5 wd:5 fd:0
Sep 12 01:04:37 localhost kernel: disk 0, o:1, dev:sdb
Sep 12 01:04:37 localhost kernel: disk 1, o:1, dev:sdc
Sep 12 01:04:37 localhost kernel: disk 2, o:1, dev:sdd
Sep 12 01:04:37 localhost kernel: disk 3, o:1, dev:sdf
Sep 12 01:04:37 localhost kernel: disk 4, o:1, dev:sde
Sep 12 01:04:37 localhost kernel: .<6>md: syncing RAID array md0
Sep 12 01:04:37 localhost kernel: md: minimum _guaranteed_
reconstruction speed: 1000 KB/sec/disc.
Sep 12 01:04:37 localhost kernel: md: using maximum available idle IO
bandwith (but not more than 200000 KB/sec) for reconstruction.
Sep 12 01:04:37 localhost kernel: md: using 128k window, over a total of
293057280 blocks.
Sep 12 01:04:37 localhost kernel: md: md1 stopped.
Sep 12 01:04:37 localhost last message repeated 4 times
So it seems to me that mdrun forced the array to start, and since it
began "syncing", it assumed sdc was not failed, and used all the drives
to reconstruct the parity information (sync: rebuild parity,
reconstruct: rebuild drive).
During the resync, and even when it was done, I could not access the XFS
filesystem. Both xfs_repair and xfs_check could not find a valid XFS
superblock. I let xfs_repair check the entire device and it could not
find a single XFS superblock. However, piping /dev/md0 into strings does
yield some filenames that I recognize from the device.
So now I've got this array, and I still don't know what malfunctioned.
In addition, I have a bad filesystem which I don't want to give up on,
because I'd be losing a ton of data. Anyone have any suggestions?
-Adar
PS: I'm not subscribed to debian-user, so please include me in the replies.
--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]