I can confirm that Ganglia supports the Tesla K80 GPU monitoring just fine.
Regarding GPU temperatures, I'm seeing ~60C in one of NVIDIA's
officially-certified servers for Tesla K80 (4U Supermicro
SYS-7048GR-TR). You might not want to use Tesla K20/K40 as comparisons,
because they had lower levels of GPU Boost (and thus might not push the
TDP envelope as much).
Best,
Eliot
On 06/08/2015 12:07 AM, Kevin Abbey wrote:
Thank you each for the notes. The current host bios/bmc appears to
read data from a MIC card but not the Nvidia. I'm considering to find
a method to simply force an increased fan speed in the server for jobs
using the gpu. I'll also ask intel again if they can help, perhaps
with a custom sdr file. I assume they have done this on their current
generation of hardware which would hopefully be portable to a
sandybrige board.
Are there published average running temperatures of gpu: k20, k40, k80?
nvidia-smi reported 66C during a few test jobs. This is below the
power throttle temperature on the gpu, but the utilization was still
below 75%.
Thanks, I'll check for the ECC errors too.
Kevin
On 6/7/2015 9:14 PM, Paul McIntosh wrote:
we use nvidia-smi also
You should also keep an eye out for GPU ECC errors as we have found
these are good predictors of bad things happening due to heat.
Generally you should see none.
In the past we had major issues with the node heat sensors being
designed around detecting CPU heat and not the GPU's living in the
same box. A firmware upgrade fixed the issue but the ECC checks where
the thing that best found the problem nodes.
Cheers,
Paul
----- Original Message -----
From: "Michael Di Domenico" <mdidomeni...@gmail.com>
To: "Beowulf Mailing List" <Beowulf@beowulf.org>
Sent: Monday, 8 June, 2015 7:50:40 AM
Subject: Re: [Beowulf] gpu+server health monitoring -- ensure system
cooling
nvidia-smi will also show the current temperature of the card. you
could script it to save the results over time. it even includes xml
output if you're savvy at parsing it
On Sat, Jun 6, 2015 at 10:09 AM, Adam DeConinck <ajde...@ajdecon.org>
wrote:
Hi Kevin,
nvidia-healthmon is the tool I've used for this kind of thing in the
past.
It can do temperature checks as well as some sanity checks for
things like
PCIe connectivity.
http://docs.nvidia.com/deploy/healthmon-user-guide/index.html
For more general monitoring (I.e. compute and memory usage), I've used
Ganglia with the NVML plugins. Not sure how well maintained these are
though.
https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia
Adam
On Friday, June 5, 2015, Kevin Abbey <kevin.ab...@rutgers.edu> wrote:
Hi,
I recently installed a Nvidia K80 gpu in a server. Can anyone share
methods and procedures for monitoring and ensuring the card is cooled
sufficiently by the server fans? I need to set this up and test
before
running any compute tests.
Thanks,
Kevin
--
Kevin Abbey
Systems Administrator
Center for Computational and Integrative Biology (CCIB)
http://ccib.camden.rutgers.edu/
--
Eliot Eshelman
Microway, Inc.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf