Re: [Beowulf] Approach For Diagnosing Heat Related Failure?

2009-07-21 Thread Joshua mora acosta
You can run HPL bound to a specific socket maximizing also the memory associated to that socket in order to try to shutdown it because of reaching the "hardware thermal control" due to lack of cooling. On BIOS you can also have HW monitoring to tell you speed of fans and perhaps detect the diff of

Re: [Beowulf] Approach For Diagnosing Heat Related Failure?

2009-07-21 Thread Dmitry Zaletnev
Jon, > I have a rack full of identical compute > nodes. One of them has become heat sensitive. > > When it's in the warm computer room it crashes. > I can't even run memtest from the CentOS DVD > for 2 seconds. However, when this node is > in my much cooler office everything works > fine. All th

Re: [Beowulf] Approach For Diagnosing Heat Related Failure?

2009-07-21 Thread Victor Gregorio
Hello Jon, If your system has temperature and fan sensors, you might be able to use lm_sensors to display component temperatures and diagnose fan failures. [r...@tesla ~]# sensors-detect # answer all defaults [r...@tesla ~]# /etc/init.d/lm_sensors start# load kernel modules [

Re: [Beowulf] Approach For Diagnosing Heat Related Failure?

2009-07-21 Thread Billy Crook
On Tue, Jul 21, 2009 at 15:42, Bill Broadley wrote: > > I'd suggest doing a visual inspection.  Make sure all fans are not blocked by > cables, are spinning.  If that looks normal pull the CPU heat sinks and make > sure they have good coverage with the heat sink goo, but not so much that it > leaks

[Beowulf] RE: Approach For Diagnosing Heat Related Failure?

2009-07-21 Thread David Mathog
Jon Forrest wrote: > I have a rack full of identical compute > nodes. One of them has become heat sensitive. > > When it's in the warm computer room it crashes. > I can't even run memtest from the CentOS DVD > for 2 seconds. However, when this node is > in my much cooler office everything works

Re: [Beowulf] Approach For Diagnosing Heat Related Failure?

2009-07-21 Thread Bill Broadley
I'd suggest doing a visual inspection. Make sure all fans are not blocked by cables, are spinning. If that looks normal pull the CPU heat sinks and make sure they have good coverage with the heat sink goo, but not so much that it leaks over the edge of the chip. When you put the heat sink back

Re: [Beowulf] Approach For Diagnosing Heat Related Failure?

2009-07-21 Thread Greg Lindahl
On Tue, Jul 21, 2009 at 11:56:23AM -0700, Jon Forrest wrote: > Other than opening the node > to spray cooling liquid when it's in the warm > room, what approach would you use to figure out which > component(s) is(are) failing? Swap them with good ones until it doesn't fail anymore? -- g ___

[Beowulf] Approach For Diagnosing Heat Related Failure?

2009-07-21 Thread Jon Forrest
I have a rack full of identical compute nodes. One of them has become heat sensitive. When it's in the warm computer room it crashes. I can't even run memtest from the CentOS DVD for 2 seconds. However, when this node is in my much cooler office everything works fine. All the other nodes are work