Re: [Beowulf] gpu+server health monitoring -- ensure system cooling

2015-06-07 Thread Kevin Abbey
s://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia Adam On Friday, June 5, 2015, Kevin Abbey wrote: Hi, I recently installed a Nvidia K80 gpu in a server. Can anyone share methods and procedures for monitoring and ensuring the card is cooled sufficiently by the server fans? I

[Beowulf] gpu+server health monitoring -- ensure system cooling

2015-06-06 Thread Kevin Abbey
Hi, I recently installed a Nvidia K80 gpu in a server. Can anyone share methods and procedures for monitoring and ensuring the card is cooled sufficiently by the server fans? I need to set this up and test before running any compute tests. Thanks, Kevin -- Kevin Abbey Systems

Re: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad?

2009-10-23 Thread Kevin Abbey
I tried this on a Supermicro board and a Sun box. On both systems the system would reboot randomly so I tuned it off. This is a serious problem of false positives. In a cluster, you may need to notify the scheduler in someway when a node reboots. Can someone elaborate on this? Specifically

Re: [Beowulf] Nehalem and Shanghai code performance for our rzf example

2009-01-17 Thread Kevin Abbey
Hi Joe, Can that 9% difference be due to the Intel capability to overclock one core and turn the others off? Or is does this Intel feature require manual switch somewhere? Thank you, Kevin Joe Landman wrote: Hi folks: Thought you might like to see this. I rewrote the interior loop for o