Bryan Irvine wrote:
Which brings me to another fun question. What's your worst
administration mistake and how did you recover? -Bryan
Discovered the hard way the symptoms of a failing drive in a RAID array,
leading to completely rebuilding an O/S install and restoring from backup.
Had a server that was running slower... and slower... and slower....
Still running, but taking forever to respond to even the simplest
prompts. Couldn't figure out what was wrong - some things made it look
like hardware, some like software.
Long story, short: turns out one of the drives in a 4-drive RAID array
was experiencing a high, and increasing, raw-read-error rate. Since
the drive's internal software was doing re-reads, and eventually
succeeding, the result was that the drive simply slowed down; and pulled
down the response time of the entire array. That's when I discovered
(after the fact) that linux md drivers don't consider long delays a
reason for failing a drive out of an array.
Worse.. when you're running a high-availability configuration (xen,
pacemaker, drbd, etc.) - one slow drive in an array on one server, drags
down the DRBD mirror, as well. The good news: when I powered down the
failing system, the backup started to work just fine. The bad news: I
trashed some stuff before figuring this out. Sigh...
If I had known, I could have pulled one drive, plugged in a new one, let
the array rebuild, and kept on going. Unfortunately, what I did was...
lots of diagnostics, lots of trial and error, ultimately trashing my
system and some user data (not a lot.. good backups).. and ultimately
had to reinstall the o/s and restore from backup.
Four lessons learned:
- RAID and high-availability configurations are vulnerable to a single
drive failure
- keep a close eye on the raw-read-error rates of drives (anything over
0 raises questions)
- be sure to purchase server-grade drives (they assume that failures
will be handled by a RAID array, so spend less time trying to recover
from a read error)
- when one disk starts going, replace them all (assuming that they went
online at the same time)... it's amazing how similar the lifetime is for
all the disks in an array
Miles Fidelman
--
In theory, there is no difference between theory and practice.
In<fnord> practice, there is. .... Yogi Berra
--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4e70a366.8010...@meetinghouse.net