On Sunday 12 November 2006 16:13, Tim Moore wrote: > Has anyone ever seen such behavior?
Others have mentioned about attaching consoles, etc, but it's also worth trawling through any logs in /var/log to see if anything is showing up there too. Check dmesg whilst the node is under load, if you're seeing machine check problems, ECC parity problems, SCSI errors then you might catch them then (though they should also be in the logs too). If the node supports IPMI try and use that to get to any hardware logs, and if you use Ganglia to monitor the cluster have a look at that and see if there's anything there that could show if it's a user space program that could be causing it. I know users shouldn't be able to crash nodes, but we have seen that on some kernels where the OOM killer is not very good at getting things right and the machine deadlocks when the users program runs it out of RAM. Another possibility is bad blocks in the swap partition which might only show up in low memory conditions (yes, using swap is bad, but people write bad code too) and corrupt something essential that's been paged out. What does uname -a say on the box ? cheers! Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
pgpM8bCTi2scz.pgp
Description: PGP signature
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf