David Mathog wrote:
Donald Becker wrote:
On Mon, 30 Mar 2009, David Mathog wrote:
Joe Landman wrote:
Vendors have an nVidia supplied *GEMM based burn in test. Been
thinking
about a set of diagnostics end users can run as a sanity check.
My suspicion is that vendors run such burn in tests only for a very
brief time. That time being "the minimum time required to find the
percentage of failed units above which it would cost us more if they
were found to be bad in the field" - and not a second longer.
I don't know about other vendors, but that's not Penguin's approach.
By "vendor" I meant graphics card vendors, not cluster or HPC vendors.
My interest in this sort of diagnostic arose in relation to an
inexpensive graphics card bought at Newegg. I was asking here
specifically because it seemed likely that HPC vendors _would_ have
the sort of GPU diagnostic I was seeking, and might be willing to share
it. (As opposed to the tool Joe referred to, which seems not to be
generally available.)
FWIW, we agree with (and implement something similar to) Don's burn in
procedure, and yes, it sometimes annoys customers who want it *now*.
But it also (massively) reduces infant mortality rates (and we we have
even designed new disk packaging to reduce the impact of the sometimes
fatal disk malady named UPS/Fedex-osis).
This said, there really isn't a memory checker for GPUs just yet. Could
be done, and probably should be ...
Also, likely we should have a long term crunching diagnostic, where we
already know the answer to a computational problem, and simply have it
burn cycles.
But GPUs are more complex than this, we need to worry about PCIe bus
transfers, several different flavors of memory, etc.
Really, since there is very little you can do if a GPU card is toast,
other than replace it, it might be better to have the test done at this
granularity.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: land...@scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf