Checking GPU's we use the following to determine if errors exist - things seem a lot better now but in the past an ECC error was 99% a hardware issue with the GPU or how it was plugged in...
nvidia-smi -a --xml-format | grep -A 33 "<ecc_errors>" | grep "<total>" | grep -v "<total>0</total>" Obviously it could be done more nicely and there are other bits of info you can get at (e.g. driver versions etc). Paul -----Original Message----- From: Beowulf [mailto:beowulf-boun...@beowulf.org] On Behalf Of Peter Kjellström Sent: Wednesday, 23 March 2016 4:08 AM To: Olli-Pekka Lehto <olli-pekka.le...@csc.fi> Cc: beowulf@beowulf.org Subject: Re: [Beowulf] Cluster consistency checks On Tue, 22 Mar 2016 17:32:40 +0200 (EET) Olli-Pekka Lehto <olli-pekka.le...@csc.fi> wrote: > Hi, > > I finally got around to writing down my cluster-consistency checklist > that I've been planning for a long time: > > https://github.com/oplehto/cluster-checks/ Looks quite close to what we do. A few additions (randomly floating to the top): * use dshbak / pshbak / dbuck to overview pdsh output (latter two from https://www.nsc.liu.se/~kent/python-hostlist/) * use conrep to read out bios settings from hp servers * dmidecode -t memory can show dimm details We also do most of this automatically in production with our node-health-check suite (will catch bios settings, firmware, cpu and memory performance, ...). /Peter K > The goal is to try to make the baseline installation of a cluster as > consistent as possible and make vendors work for their money. :) Of > course hopefully publishing this will help vendors capture some of the > issues that slip through the cracks even before clusters are handed > over. It's also a good idea to run these types of checks during the > lifetime of the system as there's always some consistency creep as > hardware gets replaced. > > If someone is interested in contributing, pull requests or comments on > the list are welcome. I'm sure that there's something missing as well. > Right now it's just a text-file but making some nicer scripts and > postprocessing for the output might happen as well at some point. > All the examples are very HP oriented as well at this point. > > Best regards, > Olli-Pekka _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf