we use nvidia-smi also You should also keep an eye out for GPU ECC errors as we have found these are good predictors of bad things happening due to heat. Generally you should see none.
In the past we had major issues with the node heat sensors being designed around detecting CPU heat and not the GPU's living in the same box. A firmware upgrade fixed the issue but the ECC checks where the thing that best found the problem nodes. Cheers, Paul ----- Original Message ----- From: "Michael Di Domenico" <mdidomeni...@gmail.com> To: "Beowulf Mailing List" <Beowulf@beowulf.org> Sent: Monday, 8 June, 2015 7:50:40 AM Subject: Re: [Beowulf] gpu+server health monitoring -- ensure system cooling nvidia-smi will also show the current temperature of the card. you could script it to save the results over time. it even includes xml output if you're savvy at parsing it On Sat, Jun 6, 2015 at 10:09 AM, Adam DeConinck <ajde...@ajdecon.org> wrote: > Hi Kevin, > > nvidia-healthmon is the tool I've used for this kind of thing in the past. > It can do temperature checks as well as some sanity checks for things like > PCIe connectivity. > > http://docs.nvidia.com/deploy/healthmon-user-guide/index.html > > For more general monitoring (I.e. compute and memory usage), I've used > Ganglia with the NVML plugins. Not sure how well maintained these are > though. > > https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia > > Adam > > > On Friday, June 5, 2015, Kevin Abbey <kevin.ab...@rutgers.edu> wrote: >> >> Hi, >> >> I recently installed a Nvidia K80 gpu in a server. Can anyone share >> methods and procedures for monitoring and ensuring the card is cooled >> sufficiently by the server fans? I need to set this up and test before >> running any compute tests. >> >> >> Thanks, >> Kevin >> >> -- >> Kevin Abbey >> Systems Administrator >> Center for Computational and Integrative Biology (CCIB) >> http://ccib.camden.rutgers.edu/ >> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Dr Paul McIntosh Senior HPC Consultant, Technical Lead, Multi-modal Australian ScienceS Imaging and Visualisation Environment (www.massive.org.au) Monash University, Ph: 9902 0439 Mob: 0434 524935 _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf