usr broken, will the machine reboot ?

Miles Fidelman Wed, 14 Sep 2011 05:53:43 -0700

Bryan Irvine wrote:

Which brings me to another fun question. What's your worstadministration mistake and how did you recover? -Bryan

Discovered the hard way the symptoms of a failing drive in a RAID array,leading to completely rebuilding an O/S install and restoring from backup.

Had a server that was running slower... and slower... and slower....Still running, but taking forever to respond to even the simplestprompts. Couldn't figure out what was wrong - some things made it looklike hardware, some like software.

Long story, short: turns out one of the drives in a 4-drive RAID arraywas experiencing a high, and increasing, raw-read-error rate. Sincethe drive's internal software was doing re-reads, and eventuallysucceeding, the result was that the drive simply slowed down; and pulleddown the response time of the entire array. That's when I discovered(after the fact) that linux md drivers don't consider long delays areason for failing a drive out of an array.

Worse.. when you're running a high-availability configuration (xen,pacemaker, drbd, etc.) - one slow drive in an array on one server, dragsdown the DRBD mirror, as well. The good news: when I powered down thefailing system, the backup started to work just fine. The bad news: Itrashed some stuff before figuring this out. Sigh...

If I had known, I could have pulled one drive, plugged in a new one, letthe array rebuild, and kept on going. Unfortunately, what I did was...lots of diagnostics, lots of trial and error, ultimately trashing mysystem and some user data (not a lot.. good backups).. and ultimatelyhad to reinstall the o/s and restore from backup.


Four lessons learned:

- RAID and high-availability configurations are vulnerable to a singledrive failure- keep a close eye on the raw-read-error rates of drives (anything over0 raises questions)- be sure to purchase server-grade drives (they assume that failureswill be handled by a RAID array, so spend less time trying to recoverfrom a read error)- when one disk starts going, replace them all (assuming that they wentonline at the same time)... it's amazing how similar the lifetime is forall the disks in an array


Miles Fidelman

--
In theory, there is no difference between theory and practice.
In<fnord>  practice, there is.   .... Yogi Berra



--

To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.orgwith a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Archive: http://lists.debian.org/4e70a366.8010...@meetinghouse.net

Re: Worst Admin Mistake? was --> Re: /usr broken, will the machine reboot ?

Reply via email to