Hello Jon, If your system has temperature and fan sensors, you might be able to use lm_sensors to display component temperatures and diagnose fan failures.
[r...@tesla ~]# sensors-detect # answer all defaults [r...@tesla ~]# /etc/init.d/lm_sensors start # load kernel modules [r...@tesla ~]# sensors # check sensor stats Hope this helps. Regards, -- Victor Gregorio Penguin Computing On Tue, Jul 21, 2009 at 11:56:23AM -0700, Jon Forrest wrote: > I have a rack full of identical compute > nodes. One of them has become heat sensitive. > > When it's in the warm computer room it crashes. > I can't even run memtest from the CentOS DVD > for 2 seconds. However, when this node is > in my much cooler office everything works > fine. All the other nodes are working fine > in the computer room. > > I'm not convinced the problem is actually > the memory. Other than opening the node > to spray cooling liquid when it's in the warm > room, what approach would you use to figure out which > component(s) is(are) failing? > > Cordially, > -- > Jon Forrest > Research Computing Support > College of Chemistry > 173 Tan Hall > University of California Berkeley > Berkeley, CA > 94720-1460 > 510-643-1032 > jlforr...@berkeley.edu > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf