----- "Chris Samuel" <[EMAIL PROTECTED]> wrote: > Does anyone have any bright ideas ?
Wow, thanks so much to everyone who responded on this both to the list and in private, very much appreciated! Given there were so many of these I thought I'd try and comment on the main points that people raised rather than reply individually. 1) Power (lots of people) The vendor swapped in a new PSU in one of these nodes this morning, so we are resuming attempts to reproduce this failure now. The odd thing that we've noticed is that this often seems to happen when the node is only partly loaded (though not exclusively); for instance at one point we saw a node fail with Fluent running on 4 cores and a home grown code on another core (3 spare). 2) HT lockups (Scott and potentially Don) We've seen the same "System Firmware Error" messages on some of our nodes, sometimes associated with a system lockup, so we're going to look into BIOS upgrades. 3) Fluent Well we had a node power off this morning that wasn't running Fluent, but instead had a 4 CPU Gaussian job, some NAMD processes from various jobs and some random user compiled code. I don't know whether to be glad that I Fluent isn't so special or worried that other code can kill nodes. :-/ 4) IPMI (Bogdan) We wondered if the IPMI/BMC module might have done the power off too, but we would hope that we would see something in the logs. Anyway, we'll carry on with this using the hints and tips that people have provided and when (if?) we solve this I'll certainly update the list with what we find! Once again thanks so much to all of you who took the time to reply. All the best, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf