Hi. Dunno if this is a bright idea, but what about the power supply temperature? There are usually no measurements done in there, and a hot power supply could easily have a thermal fuse that gets tripped.
It maybe worthwhile trying with a different power box, if possible with a higher power rating. Cheers, -Alan -----Missatge original----- De: [EMAIL PROTECTED] en nom de Chris Samuel Enviat el: dl. 08/12/2008 04:33 Per a: Beowulf List A/c: David Bannon; Brett Pemberton Tema: [Beowulf] Odd SuperMicro power off issues Hi folks, We've been tearing our hair out over this for a little while and so I'm wondering if anyone else has seen anything like this before, or has any thoughts about what could be happening ? Very occasionally we find one of our Barcelona nodes with a SuperMicro H8DM8-2 motherboard powered off. IPMI reports it as powered down too. No kernel panic, no crash, nothing in the system logs. Nothing in the IPMI logs either, it's just sitting there as if someone has yanked the power cable (and we're pretty sure that's not the cause!). There had not been any discernible pattern to the nodes affected, and we've only a couple nodes where it's happened twice, the rest only have had it happen once and scattered over the 3 racks of the cluster. For the longest time we had no way to reproduce it, but then we noticed that for 3 of the power off's there was a particular user running Fluent on there. They've provided us with a copy of their problem and we can (often) reproduce it now with that problem. Sometimes it'll take 30 minutes or so, sometimes it'll take 4-5 hours, sometimes it'll take 3 days or so and sometimes it won't do it at all. It doesn't appear to be thermal issues as (a) there's nothing in the IPMI logs about such problems and (b) we inject CPU and system temperature into Ganglia and we don't see anything out of the ordinary in those logs. :-( We've tried other codes, including HPL, and Advanced Clustering's Breakin PXE version, but haven't managed to (yet) get one of the nodes to fail with anything except Fluent. :-( The only oddity about Fluent is that it's the only code on the system that uses HP-MPI, but we used the command line switches to tell it to use the Intel MPI it ships with and it did the same then too! I just cannot understand what is special about Fluent, or even how a user code could cause a node to just turn off without a trace in the logs. Obviously we're pursuing this through the local vendor and (through them) SuperMicro, but to be honest we're all pretty stumped by this. Does anyone have any bright ideas ? cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf