Software RAID 5 SATA array crashed

Adar Dembo Sat, 17 Sep 2005 18:15:31 -0700

Back in October I setup a software RAID 5 array using MD. I used 5x300gig SATA-II drives, running on two Promise TX4 SATAII controllers (thenew ones with NCQ). One controller connected to two drives, and theother to three.

A few days ago, after moving to a new house, I set up the servercontaining the array and tried to connect to it. I couldn't reach theserver through the intranet, so I hooked up a keyboard and monitor tosee what was up. When I peered in, I saw that the kernel hadn't evenfinished its boot procedure. Right as md was loaded by the kernel (it isbuilt in, not a module), there was a call stack and a kernel error "IRQ193: nobody cared!" or something similar. Following that were repeatingmessages about SCSI commands failing on three of my drives I believe.

Rebooting the machine didn't make the behavior go away. I powered it offand reseated all of the SATA connectors. This time, when booting up, Imade progress. Here is what syslog said upon autodetecting the MD array:


Sep 11 23:46:57 localhost kernel: md: Autodetecting RAID arrays.
Sep 11 23:46:57 localhost kernel: md: autorun ...
Sep 11 23:46:57 localhost kernel: md: considering sdf1 ...
Sep 11 23:46:57 localhost kernel: md:  adding sdf1 ...
Sep 11 23:46:57 localhost kernel: md:  adding sde1 ...
Sep 11 23:46:57 localhost kernel: md:  adding sdd1 ...
Sep 11 23:46:57 localhost kernel: md:  adding sdc1 ...
Sep 11 23:46:57 localhost kernel: md:  adding sdb1 ...
Sep 11 23:46:57 localhost kernel: md: created md0
Sep 11 23:46:57 localhost kernel: md: bind<sdb1>
Sep 11 23:46:57 localhost kernel: md: bind<sdc1>
Sep 11 23:46:57 localhost kernel: md: bind<sdd1>
Sep 11 23:46:57 localhost kernel: md: bind<sde1>
Sep 11 23:46:57 localhost kernel: md: bind<sdf1>

Sep 11 23:46:57 localhost kernel: md: running:<sdf1><sde1><sdd1><sdc1><sdb1>

Sep 11 23:46:57 localhost kernel: md: kicking non-fresh sdc1 from array!
Sep 11 23:46:57 localhost kernel: md: unbind<sdc1>
Sep 11 23:46:57 localhost kernel: md: export_rdev(sdc1)

Sep 11 23:46:57 localhost kernel: md: md0: raid array is not clean --starting background reconstructionSep 11 23:46:57 localhost kernel: raid5: device sdf1 operational as raiddisk 4Sep 11 23:46:57 localhost kernel: raid5: device sde1 operational as raiddisk 3Sep 11 23:46:57 localhost kernel: raid5: device sdd1 operational as raiddisk 2Sep 11 23:46:57 localhost kernel: raid5: device sdb1 operational as raiddisk 0Sep 11 23:46:57 localhost kernel: raid5: cannot start dirty degradedarray for md0

Sep 11 23:46:57 localhost kernel: RAID5 conf printout:
Sep 11 23:46:57 localhost kernel:  --- rd:5 wd:4 fd:1
Sep 11 23:46:57 localhost kernel:  disk 0, o:1, dev:sdb1
Sep 11 23:46:57 localhost kernel:  disk 2, o:1, dev:sdd1
Sep 11 23:46:57 localhost kernel:  disk 3, o:1, dev:sde1
Sep 11 23:46:57 localhost kernel:  disk 4, o:1, dev:sdf1
Sep 11 23:46:57 localhost kernel: raid5: failed to run raid set md0
Sep 11 23:46:57 localhost kernel: md: pers->run() failed ...
Sep 11 23:46:57 localhost kernel: md: do_md_run() returned -22
Sep 11 23:46:57 localhost kernel: md: md0 stopped.
Sep 11 23:46:57 localhost kernel: md: unbind<sdf1>
Sep 11 23:46:57 localhost kernel: md: export_rdev(sdf1)
Sep 11 23:46:57 localhost kernel: md: unbind<sde1>
Sep 11 23:46:57 localhost kernel: md: export_rdev(sde1)
Sep 11 23:46:57 localhost kernel: md: unbind<sdd1>
Sep 11 23:46:57 localhost kernel: md: export_rdev(sdd1)
Sep 11 23:46:57 localhost kernel: md: unbind<sdb1>
Sep 11 23:46:57 localhost kernel: md: export_rdev(sdb1)
Sep 11 23:46:57 localhost kernel: md: ... autorun DONE.

Note the message about sdc being non-fresh. Also note that the array isboth DIRTY and DEGRADED. Degraded (I'm guessing) because sdc is detectedas failed, and dirty because the machine was powered off when it waserroring and the array wasn't able to flush properly.

I played around with mdadm but I could never get the array to start. Allof the superblocks were intact, including sdc. Finally, I ran "mdrun",which managed to start the array. Here is the logging associated withthis command:


Sep 12 01:04:37 localhost kernel: md: md0 stopped.
Sep 12 01:04:37 localhost kernel: md: bind<sdc>
Sep 12 01:04:37 localhost kernel: md: bind<sdd>
Sep 12 01:04:37 localhost kernel: md: bind<sdf>
Sep 12 01:04:37 localhost kernel: md: bind<sde>
Sep 12 01:04:37 localhost kernel: md: bind<sdb>

Sep 12 01:04:37 localhost kernel: md: md0: raid array is not clean --starting background reconstructionSep 12 01:04:37 localhost kernel: raid5: device sdb operational as raiddisk 0Sep 12 01:04:37 localhost kernel: raid5: device sde operational as raiddisk 4Sep 12 01:04:37 localhost kernel: raid5: device sdf operational as raiddisk 3Sep 12 01:04:37 localhost kernel: raid5: device sdd operational as raiddisk 2Sep 12 01:04:37 localhost kernel: raid5: device sdc operational as raiddisk 1

Sep 12 01:04:37 localhost kernel: raid5: allocated 5248kB for md0

Sep 12 01:04:37 localhost kernel: raid5: raid level 5 set md0 activewith 5 out of 5 devices, algorithm 2

Sep 12 01:04:37 localhost kernel: RAID5 conf printout:
Sep 12 01:04:37 localhost kernel:  --- rd:5 wd:5 fd:0
Sep 12 01:04:37 localhost kernel:  disk 0, o:1, dev:sdb
Sep 12 01:04:37 localhost kernel:  disk 1, o:1, dev:sdc
Sep 12 01:04:37 localhost kernel:  disk 2, o:1, dev:sdd
Sep 12 01:04:37 localhost kernel:  disk 3, o:1, dev:sdf
Sep 12 01:04:37 localhost kernel:  disk 4, o:1, dev:sde
Sep 12 01:04:37 localhost kernel: .<6>md: syncing RAID array md0

Sep 12 01:04:37 localhost kernel: md: minimum _guaranteed_reconstruction speed: 1000 KB/sec/disc.Sep 12 01:04:37 localhost kernel: md: using maximum available idle IObandwith (but not more than 200000 KB/sec) for reconstruction.Sep 12 01:04:37 localhost kernel: md: using 128k window, over a total of293057280 blocks.

Sep 12 01:04:37 localhost kernel: md: md1 stopped.
Sep 12 01:04:37 localhost last message repeated 4 times

So it seems to me that mdrun forced the array to start, and since itbegan "syncing", it assumed sdc was not failed, and used all the drivesto reconstruct the parity information (sync: rebuild parity,reconstruct: rebuild drive).

During the resync, and even when it was done, I could not access the XFSfilesystem. Both xfs_repair and xfs_check could not find a valid XFSsuperblock. I let xfs_repair check the entire device and it could notfind a single XFS superblock. However, piping /dev/md0 into strings doesyield some filenames that I recognize from the device.

So now I've got this array, and I still don't know what malfunctioned.In addition, I have a bad filesystem which I don't want to give up on,because I'd be losing a ton of data. Anyone have any suggestions?


-Adar

PS: I'm not subscribed to debian-user, so please include me in the replies.


--

To UNSUBSCRIBE, email to [EMAIL PROTECTED]with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Software RAID 5 SATA array crashed

Reply via email to