On Mon, 30 Mar 2009, David Mathog wrote: > Joe Landman wrote: > > Vendors have an nVidia supplied *GEMM based burn in test. Been thinking > > about a set of diagnostics end users can run as a sanity check. > > My suspicion is that vendors run such burn in tests only for a very > brief time. That time being "the minimum time required to find the > percentage of failed units above which it would cost us more if they > were found to be bad in the field" - and not a second longer.
I don't know about other vendors, but that's not Penguin's approach. One reason is that we don't know the failure profile. But really it's a trade-off between delivery expectations, likelihood of failures, and even how much air conditioning capacity remains in the burn-in room. We used to have a published policy of a minimum three day successful burn-in. If a part failed, or even if the machine rebooted, the three day clock started again. The challenge with that policy is that it leads to unpredictable delivery, which is distressing to someone that needs servers or workstations Right Now. Today the policy is much more flexible, in part driven by Penguin's change to building mostly clusters. Burn-in time is based on the product, potentially modified by per-machine notes on the customer delivery requirements. Cluster nodes have a preliminary stand-alone burn-in before being racked into a cluster. Whole clusters then have a full burn-in, usually running benchmarks and demo applications. You might expect nearly zero errors when already-tested machines are grouped in a cluster, but cluster applications can reveal errors that typical burn-in tests don't trigger. And even a low percentage of failures looks pretty bad when you have a few hundred machines in a cluster. > Finding > marginal memory, certainly one of the easier tests, can easily take 24 > hours of testing. And typically those memory modules test OK in a tester, even after being pulled from a machine showing memory errors. (That's not surprising, since most distributors test modules just before shipping them, and they are tested again just before installation.) > Somehow I cannot imagine vendors spending quite that > long burning in a graphics card. Well, maybe a top of the line pro > card, but certainly not your run of the mill $39 budget card. I'm guessing every vendor shipping big clusters or CUDA GPU systems does a substantial burn-in, although it's likely rare that they use parallel applications and check for successful runs. It's consumer-oriented low end production lines that can't fit a longer burn-in into the process. A production line with pre-imaged OS installations pretty much cannot do a full burn-in. -- Donald Becker bec...@scyld.com Penguin Computing / Scyld Software www.penguincomputing.com www.scyld.com Annapolis MD and San Francisco CA _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf