Despite my Duke e-mail address, I've been at Google since July. While
I'm not a co-author, I'm part of the group that did this study and can
answer (some) questions people may have about the paper.
Dangling meat in front of the bears, eh? Well...
I can always hide behind my duck-blind-slash-moat-o'-NDA. :)
Is there any info for failure rates versus type of main bearing
in the drive?
Failure rate versus any other implementation technology?
We haven't done this analysis, but you might be interested in this paper
from CMU:
http://www.usenix.org/events/fast07/tech/schroeder.html
They performed a similar study on drive reliability -- with the help of
some people/groups here, I believe -- and found no significant
differences in reliability between different disk technologies (SATA,
SCSI, IDE, FC, etc).
Failure rate vs. drive speed (RPM)?
Again, we may have the data but it hasn't been processed.
Or to put it another way, is there anything to indicate which
component designs most often result in the eventual SMART
events (reallocation, scan errors) and then, ultimately, drive
failure?
One of the problems noted in the paper is that even if you assume that
*any* SMART event is indicative in some way of an upcoming failure --
and are willing to deal with a metric boatload of false positives --
over one-third of failed drives had zero counts on all SMART parameters.
And one of these parameters -- seek errors -- were observed on nearly
three-quarters of the drives in our fleet, so you really would be
dealing with boatloads of false positives.
Failure rates versus rack position? I'd guess no effect here,
since that would mostly affect temperature, and there was
little temperature effect.
I imagine it wouldn't matter. Even if it did, I'm not sure we have this
data in an easy-to-parse-and-include format.
Failure rates by data center? (Are some of your data centers
harder on drives than others? If so, why?)
The CMU study is broken down by data center. There is certainly the
case in their study that some data centers appear to be harder on drives
than others, but there may be age and vintage issues coming into play in
their study (an issue they acknowledge in the paper). My intuition --
again, not having analyzed the data -- is that application
characteristics and not data center characteristics are going to have a
more pronounced effect. There is a section on how utilization effects
AFR over time.
Are there air pressure and humidity measurements from your data
centers? Really low air pressure (as at observatory height) is a known
killer of disks, it would be interesting if lesser changes in air
pressure also had a measurable effect. Low humidity cranks up static
problems, high humidity can result in condensation.
Once we start getting data from our Tibetan Monastery/West Asia data
center I'll let you know. :)
-jdm
Department of Computer Science, Duke University, Durham, NC 27708-0129
Email: [EMAIL PROTECTED]
Web: http://www.cs.duke.edu/~justin/
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf