For the forseeable future, I'm not developing much but will use the
hybrid SMP/DM capabilities in WRF. Takes advantage of SMP availability,
and supports message passing between SMP nodes. I've not used this
capability for benchmarking but it appears to offer significant gains.
As we get more hybrid HPC capabilities planning for this will be more
important. A lot of system administrators (based on a statistical
sample of 4 local) have decreed that this is inefficient and one should
either do isolated shared memory or distributed memory so that we don't
make our Gaussian users feel unloved. I'm skeptical.
gerry
Joseph Mack NA3T wrote:
I've googled the internet and searched the Beowulf archives
for "hybrid" || "multicore" and the only definitive statement I've found
is by Greg Lindahl, 17 Dec 2004
"Most of the folks interested in hybrid models a few years ago have now
given it up".
I assume this was from the era of 2-way SMP nodes.
Multicore CPUs are being projected for 15yrs into the future (statement
by Pat Gelsinger, Intel's CTO, quoted in
http://cook.rfe.org/grid.pdf)
I expect the programming model will be a little different
for single image machines like the Altix, than for beowulfs
where each node has its own kernel (and which I assume will
be running dual quadcore mobos).
Still if a flat, one network model is used, all processes communicate
through the off-board networking. Someone with a quadcore machine,
running MPI on a flat network, told me that their application scales
poorly to 4 processors. Instead if processes on cores within a package
were working on adjacent parts of the compute volume and communicated
through the on-board networking, then for a quadcore machine, the
off-board networking bandwidth requirement would drop by a factor of 4
and scaling would improve.
In a quadcore machine, if 4 OMP/threads processes are started on each
quadcore package, could they be rescheduled at the end of their
timeslice, on different cores arriving at a cold cache? On a large
single image machine, could a thread be scheduled on another node and
have to communicate over the off-board network? In a single image
machine (with a single address space) how does the OS know to malloc
memory from the on-board memory, rather than some arbitary location (on
another board)?
I expect everyone here knows all this. How is everyone going to program
the quadcore machines?
Thanks Joe
--
Gerry Creager -- [EMAIL PROTECTED]
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf