Re: ami and bioctl questions

Jeff Ross Tue, 18 Nov 2008 08:13:41 -0800

Dieter wrote:

At work I've got a server with an LSI MegaRAID (dmesg below) thatsuddenly seems to be killing hard drives. Last Thursday I had one drivefail, and the system didn't begin rebuilding onto the hot spare until Irebooted.
I would hope that the controller isn't killing drives.


Me, too.  Or the enclosure.

Can we presume the system has clean power, temps are ok, no vibration, etc. ?

Yes, all power is through an MGE Pulsar Evolution. The server is rackmounted, and sysctl reports all temps as normal.


[EMAIL PROTECTED]:/home/jross $ sysctl -a | grep hw

hw.sensors.ami0.drive0=degraded (sd0), WARNING
hw.sensors.ami0.drive1=online (sd1), OK
hw.sensors.ami0.drive2=online (sd2), OK
hw.sensors.safte0.temp0=23.00 degC, OK
hw.sensors.safte1.temp0=24.00 degC, OK
hw.sensors.lm1.temp0=40.00 degC
hw.sensors.lm1.temp1=29.00 degC
hw.sensors.lm1.temp2=29.50 degC
hw.sensors.lm1.fan0=6026 RPM
hw.sensors.lm1.fan1=6026 RPM

Hitachi's drive testing tool seems to be windows only, so are there anydrive checking utilities that can check an individual drive when it's apart of a RAID1? Or is it safe to assume that if the drive fails in theRAID it is really dead. I'm trying to make sure I'm not seeing somekind of problem with the enclosure or the megaraid card before I startshipping drives back to Hitachi.
Can you get the SMART data from the drives?  Interpreting SMART data
is another problem, but maybe you can find a clue there.

Is it possible that the drives just took "too long" to read or write and
the RAID marked them bad?  Maybe remapping a bad sector takes too long...

Maybe hook them to a different controller (no RAID) and do a simple test
with dd over the entire drive, something like

dd if=/dev/suspect_disk of=/dev/null bs=1m
dd if=/dev/zero of=/dev/suspect_disk bs=1m

and see if you get any errors from dd or in dmesg.

Last night after all the users left I rebooted the server to get intothe MegaRAID controller at boot. It couldn't see the brand new drive Ijust put into the safte0 enclosure so I couldn't make it a hot spare.

I installed the now two drives that have failed into another server withan identical setup (one minor variation--it has two separate LSIMegaRAID cards instead of one card with two channels) and a completelyempty safte enclosure and again the card could not see the drives atall. I'm thinking that means they really are dead.

I have another chassis and a new SuperMicro motherboard with an onboardSCSI that I'll build up today. Then I should be able to get access tothe individual drives without going through the LSI raid card and try todo those tests you suggest.

The fact that the LSI card couldn't see that new drive (identical insize, but 15K instead of 10K) is disconcerting to say the least. Theonly comforting thought is that in this case sd0 contains the / , swap,/usr and so on partitions--all operating system and no database or webserver partitions. I think I'll double up on the tape backups, just tobe sure.


Thanks for the suggestions.

Jeff

Re: ami and bioctl questions

Reply via email to