I have a compute node that has started dropping off. When I say drop off, I mean the node (while running a job) will lose all connectivity and the machine does not respond. I have viewed the logs and can find no reason for the node to cease functioning.

if you connect a console to such a node, is it simply panic'ed?

Has anyone ever seen such behavior?

I have the occasional node which turns itself off under load.
the IPMI reports power being off, so it's distinct from panics.
the IPMI system-error-log doesn't show any reason.

we (and the vendor) regard this as grounds for repair (usually
the power supply).

regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to