You can run HPL bound to a specific socket maximizing also the memory
associated to that socket in order to try to shutdown it because of reaching
the "hardware thermal control" due to lack of cooling.
On BIOS you can also have HW monitoring to tell you speed of fans and perhaps
detect the diff of
Jon,
> I have a rack full of identical compute
> nodes. One of them has become heat sensitive.
>
> When it's in the warm computer room it crashes.
> I can't even run memtest from the CentOS DVD
> for 2 seconds. However, when this node is
> in my much cooler office everything works
> fine. All th
Hello Jon,
If your system has temperature and fan sensors, you might be able to use
lm_sensors to display component temperatures and diagnose fan failures.
[r...@tesla ~]# sensors-detect # answer all defaults
[r...@tesla ~]# /etc/init.d/lm_sensors start# load kernel modules
[
On Tue, Jul 21, 2009 at 15:42, Bill Broadley wrote:
>
> I'd suggest doing a visual inspection. Make sure all fans are not blocked by
> cables, are spinning. If that looks normal pull the CPU heat sinks and make
> sure they have good coverage with the heat sink goo, but not so much that it
> leaks
Jon Forrest wrote:
> I have a rack full of identical compute
> nodes. One of them has become heat sensitive.
>
> When it's in the warm computer room it crashes.
> I can't even run memtest from the CentOS DVD
> for 2 seconds. However, when this node is
> in my much cooler office everything works
I'd suggest doing a visual inspection. Make sure all fans are not blocked by
cables, are spinning. If that looks normal pull the CPU heat sinks and make
sure they have good coverage with the heat sink goo, but not so much that it
leaks over the edge of the chip. When you put the heat sink back
On Tue, Jul 21, 2009 at 11:56:23AM -0700, Jon Forrest wrote:
> Other than opening the node
> to spray cooling liquid when it's in the warm
> room, what approach would you use to figure out which
> component(s) is(are) failing?
Swap them with good ones until it doesn't fail anymore?
-- g
___
I have a rack full of identical compute
nodes. One of them has become heat sensitive.
When it's in the warm computer room it crashes.
I can't even run memtest from the CentOS DVD
for 2 seconds. However, when this node is
in my much cooler office everything works
fine. All the other nodes are work