Hi,

As I mentioned in my previous posting, the 20 node Tyan S2891 Dual Opteron dual core Debian cluster (1 NFS providing head node, 19 diskless compute nodes) is currently experiencing 2 intermittent problems which I'm trying to diagnose.

After a few days of testing and digging through system logs I'm pretty much stumped as to what may be causing these. There are 2 separate problems - anyones opinions on how to go about diagnosing these problems or things I might have missed would be most welcome.

Problem #1
Over the last 6 months, 3 different nodes have been found in a powered down state - the nodes seem to have powered off during a run of the model. There are no interesting messages in the system logs co-inciding with the time of these shutdowns. My first suspect was the power supply to cluster but the UPS power system has logged no errors co-inciding with these failures. I've run a bunch of stress testers on the systems that failed including cpuburn and cpustress in the hope that a failing component such as psu or processors would be triggered again -- but all the systems happily ran 24 hours of tests without any problems. 2 of the 3 failing systems are logging some MCE messages - but they seem to be standard memory errors which are being corrected by the system. Any suggestions on where to go next?

Problem #2
On 2 occasions over the last 6 months one of the 2 oceanographic models we run on this cluster (ROMS, the other being SWAN) has gone into a state where it is running significantly slower than usual. This seems to have been preceeded by us running the other model but we can't reproducibly get the system into this state. Looking at various process stats - when the model is in the slowed down state - the model goes from about 30% system cpu time, 60% user cpu time to about 60% system cpu time and 30% user cpu time. Again, nothing unusual in the logs, nor in the gigabit switch logs. A quick strace of one of the running model processes didn't show anything significantly unusual (although I don't normally sit there watching straces of the model during normal operational so I could well have missed all sorts of things here). Again, any suggestions on where to go next on this would be welcome, I'm wondering if I'm seeing some strange kernel-level or MPI-level problem which only manifests under certain conditions but I can't even guess at this stage what those conditions might be.

Thanks,

-stephen

--
Stephen Mulcahy, Applepie Solutions Ltd., Innovation in Business Center,
GMIT, Dublin Rd, Galway, Ireland.  +353.91.751262  http://www.aplpi.com
Registered in Ireland, no. 289353 (5 Woodlands Avenue, Renmore, Galway)
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to