It seems there are at least 3 dimensions for expansion.  What (in your
opinion) is the right tradeoff between more cores, more processors and
more
individual compute nodes?

I'd claim this is not a matter of opinion, but rather a matter of which
things matter most to you: memory bandwidth or capacity, density,
interconnect bandwidth, perhaps even disk IO bandwidth.

In particular, I am thinking of in-house parallel finite difference /
finite element codes,
parallel BLAS, and maybe some commercial Monte-Carlo codes (the last
being an
embarrassingly parallel problem).

montecarlo, from what I see, is both emb-par and tiny, so really just wants lots of cores, little memory, light interconnect, etc.

but that's an extreme; more generally the right choice depends on issues like how cache-friendly the code is (thus less sensitive to the core-to-memory-bandwidth ratio), whether on-node shared memory is a big win (still faster than interonnect, easier to program), whether memory _capacity_ is more of an issue (which with AMD leads to more sockets/node), etc.

it does seem like finite-element stuff tends to have relatively high work-to-surface-area, so is not terribly demanding of interconnect
(cheaper interconnect, and less harm from multiple cores per node).
similarly, higher levels of blas are less demanding of mem-bw.

I have been set the task of building our first cluster for these
applications.
Our existing in-house codes run on an SGI machine with a parallelizing
compiler.
They would need to be ported to use MPI on a cluster.

would they? have you considered whether they'd run well on something like an 8-socket, 16-core AMD system? I'm guessing the SGI is an older
mips-based Origin, and thus has dramatically slower CPUs.

by "parallelizing compiler" do you mean OpenMPI?

However, I do not
understand
what happens when you have multi-processor/multi-core nodes in a
cluster.  Do you
just use MPI (with each thread using its own non-shared memory) or is
there any
way to do "mixed-mode" programming which takes advantage of shared
memory within a
node (like, an MPI/OpenMP hybrid?).

sure, all the memory in a node is shared, so you can use threads or other
shared-memory techniques if you want.  but this takes lots of additional
effort.  is it worth it?  bear in mind that any MPI will take some advantage
of faster access to a peer which happens to be on the same node. and there are some packages (eg goto-blas) which can use threads internally,
and thus give you speedup even if you don't explicitly program the threads.

I don't see anyone bothering with this on our clusters - people who make the jump to MPI tend not to care about small factors like 2 vs 4 cores/node,
since they're aiming at 3-digit core counts.  it's also easier to schedule
an n-way MPI job that has no requirements about the layout of workers,
versus one which would require all the cpus on all of its nodes.

for your transition, I would guess you need a combo-cluster: some nice fat
nodes, as well as a decent-sized set of MPI-friendly ones.  you really need
to investigate your workload to figure out whether you can use gigabit
everywhere (surprisingly effective, even for serious MPI that's not emb-par)
or whether you need to step up to a real HPC interconnect (to me, that would be either InfiniPath or Myrinet-10G.)

regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to