I thought I would share an unpleasant experience - surviving a double disk failure with raid 5.

We have >1000 core cluster, 130TB lustre setup and obviously a lot of spindles. Our lustre is a "cheap" scalable setup, 30 oss's with software raid 5.

So what happens when you get a double disk failure in an OSS? Well the md device drops, lustre on the OSS obviously can't write to disk and clients start getting errors. In Australia we say "it's gone balls up".

How to recover without losing (much) data? Well first let me say I generally hate hardware raid solutions. Have a double disk failure and your stuffed. Rebuild the lun, rebuild you fs and say sorry to your clients/customers. It isn't fun, trust me, been there done that. Not fun at all.

I love software raid. First you get a real CPU and lots of memory behind your raid controller and a real os which allows you to recover from double disk failure.

Ok. So what did we do? First we noted which sectors gave the error and shutdown. Then remove one of the failed disks and put in a new one. Boot up and your md device won't start - you have 1 failed disk and 1 new disk.

Ok. This is where Linux is great.

Find the bad sectors and check they are faulty with dd, by reading a few mb around the failed sectors. Make sure you know the smallest block of corruption. Now dd over the top of the corruption causing the disk to reallocate those sectors.

Ok, now force assemble the raid device and rebuild/resync onto the new disk. Sure you will not have a coherent fs any more but you will have most of your data.

Once the raid rebuilds your new disk should be a byte for byte replica of the first disk you removed except for the area you dd'd over. Now shutdown and remove the failed disk and put the first failed disk back in and reboot. Stop the raid device (which should have started in degraded mode) and dd the appropriate sectors from the failed disk to the new disk replacing the corrupt-rebuilt area (hopefully this works, and you didn't suffer the dubble failure at exactly the same place on both disks). Then shutdown again and put another new disk in. Reboot and rebuild/resync your raid device onto the 2nd new disk.

Now, of course, a fsck and lustre fsck and you are back in action. Possibly with a MB or two of bad data.

File system recovered. Restart lustre and all you clients go through recovery and hopefully overwrite any bad data in your file system. Jobs which failed, rerun and again hopefully overwrite the small area that may have been corrupt. Hopefully all is ok.

What are the lessons learnt? Well with software raid Linux is both your friend and enemy. The behaviour of md got us in this mess. When md gets an error on read it recovers the data from the other disks and re-writes the blocks to the failed disk hoping the disk will reallocate. You do get a warning saying that md encountered a recoverable error. So you think it is ok. BUT the disk still failed on read and you haven't swapped it out. Some time later when another disk fails hard and you get a failed read on your other dodgy disk md sees 2 failed disks. And it's all over.

My advice: don't let Linux collude with the disk vendors and reduce your reliability. Swap any disk that gets a correctable error on read. Reallocation on write is fine not on read. The disk has failed.

Tim, your a genius. Thanks mate. Once I land back in the country, cold beers all round.


--
Stu Midgley
sdm...@gmail.com
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to