I have a compute node that has started dropping off. When I say drop off, I mean the node (while running a job) will lose all connectivity and the machine does not respond. I have viewed the logs and can find no reason for the node to cease functioning.
if you connect a console to such a node, is it simply panic'ed?
Has anyone ever seen such behavior?
I have the occasional node which turns itself off under load. the IPMI reports power being off, so it's distinct from panics. the IPMI system-error-log doesn't show any reason. we (and the vendor) regard this as grounds for repair (usually the power supply). regards, mark hahn. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf