Chris Samuel wrote:
Very occasionally we find one of our Barcelona nodes with
a SuperMicro H8DM8-2 motherboard powered off.  IPMI reports
it as powered down too.

Hi Chris,

We had a similar exerience with one of our compute nodes - intermittent power-offs when running our model and absolutely nothing in the logs. I modified Ganglia to track voltage and temp in an effort to see if anything unusual happened to those before-hand but there was no discernable trends.

I can memtest86+ a number of times on the problem node and neither it nor mcelog showed any problems.

Subsequent to that, I found aBIOS upgrade for those systems which included an Opteron microcode update to fix an AMD processor erratum (sp?) - I can dig out the details if the specific problem is of interest.

Around the same time, we finally started to see memory errors, so we also replaced the bad mmory in the system.

Unfortunately I can't tell you which was responsible for fixing the problem. My understanding is that Fluent is quite memory and I/O intensive - do you run other equally intensive models without seeing the failure?

Anyways, in summary - if you're totally stumped - try swapping out the memory and/or rolling to the latest firmware and see if that improves the stability.

-stephen

--
Stephen Mulcahy       Applepie Solutions Ltd.      http://www.aplpi.com
Registered in Ireland, no. 289353 (5 Woodlands Avenue, Renmore, Galway)
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to