Dieter wrote:
At work I've got a server with an LSI MegaRAID (dmesg below) that
suddenly seems to be killing hard drives. Last Thursday I had one drive
fail, and the system didn't begin rebuilding onto the hot spare until I
rebooted.
I would hope that the controller isn't killing drives.
Me, too. Or the enclosure.
Can we presume the system has clean power, temps are ok, no vibration, etc. ?
Yes, all power is through an MGE Pulsar Evolution. The server is rack
mounted, and sysctl reports all temps as normal.
[EMAIL PROTECTED]:/home/jross $ sysctl -a | grep hw
hw.sensors.ami0.drive0=degraded (sd0), WARNING
hw.sensors.ami0.drive1=online (sd1), OK
hw.sensors.ami0.drive2=online (sd2), OK
hw.sensors.safte0.temp0=23.00 degC, OK
hw.sensors.safte1.temp0=24.00 degC, OK
hw.sensors.lm1.temp0=40.00 degC
hw.sensors.lm1.temp1=29.00 degC
hw.sensors.lm1.temp2=29.50 degC
hw.sensors.lm1.fan0=6026 RPM
hw.sensors.lm1.fan1=6026 RPM
Hitachi's drive testing tool seems to be windows only, so are there any
drive checking utilities that can check an individual drive when it's a
part of a RAID1? Or is it safe to assume that if the drive fails in the
RAID it is really dead. I'm trying to make sure I'm not seeing some
kind of problem with the enclosure or the megaraid card before I start
shipping drives back to Hitachi.
Can you get the SMART data from the drives? Interpreting SMART data
is another problem, but maybe you can find a clue there.
Is it possible that the drives just took "too long" to read or write and
the RAID marked them bad? Maybe remapping a bad sector takes too long...
Maybe hook them to a different controller (no RAID) and do a simple test
with dd over the entire drive, something like
dd if=/dev/suspect_disk of=/dev/null bs=1m
dd if=/dev/zero of=/dev/suspect_disk bs=1m
and see if you get any errors from dd or in dmesg.
Last night after all the users left I rebooted the server to get into
the MegaRAID controller at boot. It couldn't see the brand new drive I
just put into the safte0 enclosure so I couldn't make it a hot spare.
I installed the now two drives that have failed into another server with
an identical setup (one minor variation--it has two separate LSI
MegaRAID cards instead of one card with two channels) and a completely
empty safte enclosure and again the card could not see the drives at
all. I'm thinking that means they really are dead.
I have another chassis and a new SuperMicro motherboard with an onboard
SCSI that I'll build up today. Then I should be able to get access to
the individual drives without going through the LSI raid card and try to
do those tests you suggest.
The fact that the LSI card couldn't see that new drive (identical in
size, but 15K instead of 10K) is disconcerting to say the least. The
only comforting thought is that in this case sd0 contains the / , swap,
/usr and so on partitions--all operating system and no database or web
server partitions. I think I'll double up on the tape backups, just to
be sure.
Thanks for the suggestions.
Jeff