Re: [Beowulf] Re: failure trends in a large disk drive population

Justin Moore Fri, 16 Feb 2007 16:37:58 -0800

Despite my Duke e-mail address, I've been at Google since July.  While
I'm not a co-author, I'm part of the group that did this study and can
answer (some) questions people may have about the paper.


Dangling meat in front of the bears, eh?  Well...


I can always hide behind my duck-blind-slash-moat-o'-NDA. :)

Is there any info for failure rates versus type of main bearing
in the drive?

Failure rate versus any other implementation technology?

We haven't done this analysis, but you might be interested in this paperfrom CMU:


http://www.usenix.org/events/fast07/tech/schroeder.html

They performed a similar study on drive reliability -- with the help ofsome people/groups here, I believe -- and found no significantdifferences in reliability between different disk technologies (SATA,SCSI, IDE, FC, etc).

Failure rate vs. drive speed (RPM)?


Again, we may have the data but it hasn't been processed.

Or to put it another way, is there anything to indicate which
component designs most often result in the eventual SMART
events (reallocation, scan errors) and then, ultimately, drive
failure?

One of the problems noted in the paper is that even if you assume that*any* SMART event is indicative in some way of an upcoming failure --and are willing to deal with a metric boatload of false positives --over one-third of failed drives had zero counts on all SMART parameters.And one of these parameters -- seek errors -- were observed on nearlythree-quarters of the drives in our fleet, so you really would bedealing with boatloads of false positives.

Failure rates versus rack position?  I'd guess no effect here,
since that would mostly affect temperature, and there was
little temperature effect.

I imagine it wouldn't matter. Even if it did, I'm not sure we have thisdata in an easy-to-parse-and-include format.

Failure rates by data center?  (Are some of your data centers
harder on drives than others?  If so, why?)

The CMU study is broken down by data center. There is certainly thecase in their study that some data centers appear to be harder on drivesthan others, but there may be age and vintage issues coming into play intheir study (an issue they acknowledge in the paper). My intuition --again, not having analyzed the data -- is that applicationcharacteristics and not data center characteristics are going to have amore pronounced effect. There is a section on how utilization effectsAFR over time.

Are there air pressure and humidity measurements from your datacenters? Really low air pressure (as at observatory height) is a knownkiller of disks, it would be interesting if lesser changes in airpressure also had a measurable effect. Low humidity cranks up staticproblems, high humidity can result in condensation.

Once we start getting data from our Tibetan Monastery/West Asia datacenter I'll let you know. :)


-jdm

Department of Computer Science, Duke University, Durham, NC 27708-0129
Email:  [EMAIL PROTECTED]
Web:    http://www.cs.duke.edu/~justin/
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Re: failure trends in a large disk drive population

Reply via email to