> -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On > Behalf Of Greg Lindahl > Sent: Thursday, August 23, 2007 12:14 PM > To: beowulf@beowulf.org > Subject: Re: [Beowulf] Intel Quad-Core or AMD Opteron > > On Thu, Aug 23, 2007 at 09:09:57AM -0400, Douglas Eadline wrote: > > > Naturally, if you have four processes running > > it is best if each one gets its own woodcrest. To the OS > > the all look the same. Other than Intel MPI, I don't > > know of any other MPI that attempts to optimize this. > > InfiniPath MPI has always optimized this. Of course, there's no way it > can read your mind and know if you are going to run a second 4-core > job on the same node, so there is no perfect default. But it has > switches to give you either behavior, tightly packed or spread out. >
You mean like this (with Scali MPI Connect, verbose output) : 1st job (4 processes on a dual socket quadcore machine) : Affinity 'automatic' policy BANDWIDTH granularity CORE nprocs 4 Will bind process 0 with mask 1000000000000000 [(socket: 0 core: 0 execunit: 0)] Will bind process 1 with mask 0000100000000000 [(socket: 1 core: 0 execunit: 0)] Will bind process 2 with mask 0100000000000000 [(socket: 0 core: 1 execunit: 0)] Will bind process 3 with mask 0000010000000000 [(socket: 1 core: 1 execunit: 0)] 2nd job on the same machine (while the other one is running) : Affinity 'automatic' policy BANDWIDTH granularity CORE nprocs 4 Will bind process 0 with mask 0010000000000000 [(socket: 0 core: 2 execunit: 0)] Will bind process 1 with mask 0000001000000000 [(socket: 1 core: 2 execunit: 0)] Will bind process 2 with mask 0001000000000000 [(socket: 0 core: 3 execunit: 0)] Will bind process 3 with mask 0000000100000000 [(socket: 1 core: 3 execunit: 0)] In Scali MPI Connect, a "bandwidth" policy means that you'd want the processes to use as many sockets as possible (i.e spread out) to optimize for memory bandwidth. There's also a "latency" policy which uses as few sockets as possible (to optimize for shared cache usage). These automatic policies will take into account processes running on the node that already have their affinity set (MPI jobs or not). The policy type (bandwidth vs. latency) is of course user controllable (who are we to say how your application performs best :) ) Anyways, enough "marketing" for now I guess. Bottom line is that having the right type of processor affinity control mechanisms have proven to be key to getting good application performance (but y'all knew that already I guess..). One thing that we've seen is that when you have these quadcore/dualsocket machines, your total job throughput is somewhat higher if you use half the cores per job, but twice the jobs (i.e like my example above). I don't yet have any good explanation for this though, but you'd probably like to discuss it :) Cheers, Steffen Persvold Chief Software Architect - Scali MPI Connect http://www.scali.com/ Higher Performance Computing _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf