Re: [Beowulf] Node Drop-Off

Chris Samuel Mon, 13 Nov 2006 06:12:38 -0800

On Sunday 12 November 2006 16:13, Tim Moore wrote:

> Has anyone ever seen such behavior?


Others have mentioned about attaching consoles, etc, but it's also worth 
trawling through any logs in /var/log to see if anything is showing up there 
too.

Check dmesg whilst the node is under load, if you're seeing machine check 
problems, ECC parity problems, SCSI errors then you might catch them then 
(though they should also be in the logs too).

If the node supports IPMI try and use that to get to any hardware logs, and if 
you use Ganglia to monitor the cluster have a look at that and see if there's 
anything there that could show if it's a user space program that could be 
causing it.

I know users shouldn't be able to crash nodes, but we have seen that on some 
kernels where the OOM killer is not very good at getting things right and the 
machine deadlocks when the users program runs it out of RAM.

Another possibility is bad blocks in the swap partition which might only show 
up in low memory conditions (yes, using swap is bad, but people write bad 
code too) and corrupt something essential that's been paged out.

What does uname -a say on the box ?

cheers!
Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

pgpM8bCTi2scz.pgp
Description: PGP signature

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Node Drop-Off

Reply via email to