Hi Chris -

We've had similar problems on two different clusters using Barcelonas with two different motherboards.

Our new cluster uses SuperMicro TwinU's (two H8DMT-INF+ motherboards in each) and was delivered in early November. Out of the roughly 590 motherboards, we had maybe 20 that powered down under load. Like yours, IPMI was still working, and so we could power these up remotely.

For nearly all of these, swapping memory fixed the problem. For systems that multiple memory swaps did not fix the problem, the vendor swapped motherboards. I do not believe we've had to swap a power supply yet for this.

On an older, smaller cluster, which uses Asus KFSN4-DRE motherboards, the incidence rate has been much higher - 20% or so - and swapping memory has not fixed the problem. On some of the systems, slowing the memory clock fixes this, but of course this causes lower computational throughput. We are still working with the vendor to fix the problem nodes; for now, we are scheduling only 6 of 8 available cores. For the job mix on that cluster, this has been a temporary
solution for most of the power off issues.

Like you, many of the codes that our users run do not cause a problem. On the Asus-based cluster, a computational cosmology code will trigger the power shutdowns. The best torture code that we've found has been xhpl (linpack) built using a threaded version of libgoto; when this is executed on a single dual
Barcelona node with "-np 8", each of the 8 MPI processes spawns 8 threads.
This particular binary will cause our bad nodes to power off very quickly
(you are welcome to a copy of the binary - just let me know).

The power draw from our Barcelona systems is very strongly dependent on the code. The power draw difference between the xhpl binary mentioned above and the typical Lattice QCD codes we run is at least 25%. Because of this we've always suspected thermal or power issues, but the vendor of our Asus-based cluster has done the obvious things to check both (eg, using active coolers on the CPU's, using larger power supplies, and so forth) and hasn't had any luck. Also, the fact that swapping memory on our SuperMicro systems helps without affecting
computational performance probably means that it is not a thermal issue on the
CPU's.

Don Holmgren
Fermilab




On Mon, 8 Dec 2008, Chris Samuel wrote:

Hi folks,

We've been tearing our hair out over this for a little
while and so I'm wondering if anyone else has seen anything
like this before, or has any thoughts about what could be
happening ?

Very occasionally we find one of our Barcelona nodes with
a SuperMicro H8DM8-2 motherboard powered off.  IPMI reports
it as powered down too.

No kernel panic, no crash, nothing in the system logs.

Nothing in the IPMI logs either, it's just sitting there
as if someone has yanked the power cable (and we're pretty
sure that's not the cause!).

There had not been any discernible pattern to the nodes
affected, and we've only a couple nodes where it's happened
twice, the rest only have had it happen once and scattered
over the 3 racks of the cluster.

For the longest time we had no way to reproduce it, but then
we noticed that for 3 of the power off's there was a particular
user running Fluent on there.   They've provided us with a copy
of their problem and we can (often) reproduce it now with that
problem.  Sometimes it'll take 30 minutes or so, sometimes it'll
take 4-5 hours, sometimes it'll take 3 days or so and sometimes
it won't do it at all.

It doesn't appear to be thermal issues as (a) there's nothing in
the IPMI logs about such problems and (b) we inject CPU and system
temperature into Ganglia and we don't see anything out of the
ordinary in those logs. :-(

We've tried other codes, including HPL, and Advanced Clustering's
Breakin PXE version, but haven't managed to (yet) get one of the
nodes to fail with anything except Fluent. :-(

The only oddity about Fluent is that it's the only code on
the system that uses HP-MPI, but we used the command line
switches to tell it to use the Intel MPI it ships with and
it did the same then too!

I just cannot understand what is special about Fluent,
or even how a user code could cause a node to just turn
off without a trace in the logs.

Obviously we're pursuing this through the local vendor
and (through them) SuperMicro, but to be honest we're
all pretty stumped by this.

Does anyone have any bright ideas ?

cheers,
Chris
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to