[Beowulf] Surviving a double disk failure

Stuart Midgley Thu, 09 Apr 2009 19:10:20 -0700

I thought I would share an unpleasant experience - surviving a doubledisk failure with raid 5.

We have >1000 core cluster, 130TB lustre setup and obviously a lot ofspindles. Our lustre is a "cheap" scalable setup, 30 oss's withsoftware raid 5.

So what happens when you get a double disk failure in an OSS? Wellthe md device drops, lustre on the OSS obviously can't write to diskand clients start getting errors. In Australia we say "it's goneballs up".

How to recover without losing (much) data? Well first let me say Igenerally hate hardware raid solutions. Have a double disk failure andyour stuffed. Rebuild the lun, rebuild you fs and say sorry to yourclients/customers. It isn't fun, trust me, been there done that. Notfun at all.

I love software raid. First you get a real CPU and lots of memorybehind your raid controller and a real os which allows you to recoverfrom double disk failure.

Ok. So what did we do? First we noted which sectors gave the errorand shutdown. Then remove one of the failed disks and put in a newone. Boot up and your md device won't start - you have 1 failed diskand 1 new disk.


Ok. This is where Linux is great.

Find the bad sectors and check they are faulty with dd, by reading afew mb around the failed sectors. Make sure you know the smallestblock of corruption. Now dd over the top of the corruption causing thedisk to reallocate those sectors.

Ok, now force assemble the raid device and rebuild/resync onto the newdisk. Sure you will not have a coherent fs any more but you will havemost of your data.

Once the raid rebuilds your new disk should be a byte for byte replicaof the first disk you removed except for the area you dd'd over. Nowshutdown and remove the failed disk and put the first failed disk backin and reboot. Stop the raid device (which should have started indegraded mode) and dd the appropriate sectors from the failed disk tothe new disk replacing the corrupt-rebuilt area (hopefully this works,and you didn't suffer the dubble failure at exactly the same place onboth disks). Then shutdown again and put another new disk in. Rebootand rebuild/resync your raid device onto the 2nd new disk.

Now, of course, a fsck and lustre fsck and you are back in action.Possibly with a MB or two of bad data.

File system recovered. Restart lustre and all you clients go throughrecovery and hopefully overwrite any bad data in your file system.Jobs which failed, rerun and again hopefully overwrite the small areathat may have been corrupt. Hopefully all is ok.

What are the lessons learnt? Well with software raid Linux is bothyour friend and enemy. The behaviour of md got us in this mess. Whenmd gets an error on read it recovers the data from the other disks andre-writes the blocks to the failed disk hoping the disk willreallocate. You do get a warning saying that md encountered arecoverable error. So you think it is ok. BUT the disk still failed onread and you haven't swapped it out. Some time later when another diskfails hard and you get a failed read on your other dodgy disk md sees2 failed disks. And it's all over.

My advice: don't let Linux collude with the disk vendors and reduceyour reliability. Swap any disk that gets a correctable error onread. Reallocation on write is fine not on read. The disk has failed.

Tim, your a genius. Thanks mate. Once I land back in the country, coldbeers all round.



--
Stu Midgley
sdm...@gmail.com
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Surviving a double disk failure

Reply via email to