Re: [Beowulf] gpu+server health monitoring -- ensure system cooling

Kevin Abbey Sun, 07 Jun 2015 21:16:35 -0700

Thank you each for the notes. The current host bios/bmc appears to readdata from a MIC card but not the Nvidia. I'm considering to find amethod to simply force an increased fan speed in the server for jobsusing the gpu. I'll also ask intel again if they can help, perhaps witha custom sdr file. I assume they have done this on their currentgeneration of hardware which would hopefully be portable to a sandybrigeboard.


Are there published average running temperatures of gpu: k20, k40, k80?

nvidia-smi reported 66C during a few test jobs. This is below the powerthrottle temperature on the gpu, but the utilization was still below 75%.


Thanks, I'll check for the ECC errors too.
Kevin


On 6/7/2015 9:14 PM, Paul McIntosh wrote:

we use nvidia-smi also

You should also keep an eye out for GPU ECC errors as we have found these are 
good predictors of bad things happening due to heat. Generally you should see 
none.

In the past we had major issues with the node heat sensors being designed 
around detecting CPU heat and not the GPU's living in the same box. A firmware 
upgrade fixed the issue but the ECC checks where the thing that best found the 
problem nodes.

Cheers,

Paul


----- Original Message -----
From: "Michael Di Domenico" <mdidomeni...@gmail.com>
To: "Beowulf Mailing List" <Beowulf@beowulf.org>
Sent: Monday, 8 June, 2015 7:50:40 AM
Subject: Re: [Beowulf] gpu+server health monitoring -- ensure system cooling

nvidia-smi will also show the current temperature of the card.  you
could script it to save the results over time.  it even includes xml
output if you're savvy at parsing it

On Sat, Jun 6, 2015 at 10:09 AM, Adam DeConinck <ajde...@ajdecon.org> wrote:

Hi Kevin,

nvidia-healthmon is the tool I've used for this kind of thing in the past.
It can do temperature checks as well as some sanity checks for things like
PCIe connectivity.

http://docs.nvidia.com/deploy/healthmon-user-guide/index.html

For more general monitoring (I.e. compute and memory usage), I've used
Ganglia with the NVML plugins. Not sure how well maintained these are
though.

https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia

Adam


On Friday, June 5, 2015, Kevin Abbey <kevin.ab...@rutgers.edu> wrote:

Hi,

I recently installed a Nvidia K80 gpu in a server. Can anyone share
methods and procedures for monitoring and ensuring the card is cooled
sufficiently by the server fans?  I need to set this up and test before
running any compute tests.


Thanks,
Kevin

--
Kevin Abbey
Systems Administrator
Center for Computational and Integrative Biology (CCIB)
http://ccib.camden.rutgers.edu/

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


--
Kevin Abbey
Systems Administrator
Center for Computational and Integrative Biology (CCIB)
http://ccib.camden.rutgers.edu/

Rutgers University - Science Building
315 Penn St.
Camden, NJ 08102
Telephone: (856) 225-6770
Fax:(856) 225-6312
Email: kevin.ab...@rutgers.edu

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] gpu+server health monitoring -- ensure system cooling

Reply via email to