For the forseeable future, I'm not developing much but will use the hybrid SMP/DM capabilities in WRF. Takes advantage of SMP availability, and supports message passing between SMP nodes. I've not used this capability for benchmarking but it appears to offer significant gains.

As we get more hybrid HPC capabilities planning for this will be more important. A lot of system administrators (based on a statistical sample of 4 local) have decreed that this is inefficient and one should either do isolated shared memory or distributed memory so that we don't make our Gaussian users feel unloved. I'm skeptical.

gerry

Joseph Mack NA3T wrote:
I've googled the internet and searched the Beowulf archives
for "hybrid" || "multicore" and the only definitive statement I've found is by Greg Lindahl, 17 Dec 2004

"Most of the folks interested in hybrid models a few years ago have now given it up".

I assume this was from the era of 2-way SMP nodes.

Multicore CPUs are being projected for 15yrs into the future (statement by Pat Gelsinger, Intel's CTO, quoted in
http://cook.rfe.org/grid.pdf)

I expect the programming model will be a little different
for single image machines like the Altix, than for beowulfs
where each node has its own kernel (and which I assume will
be running dual quadcore mobos).

Still if a flat, one network model is used, all processes communicate through the off-board networking. Someone with a quadcore machine, running MPI on a flat network, told me that their application scales poorly to 4 processors. Instead if processes on cores within a package were working on adjacent parts of the compute volume and communicated through the on-board networking, then for a quadcore machine, the off-board networking bandwidth requirement would drop by a factor of 4 and scaling would improve.

In a quadcore machine, if 4 OMP/threads processes are started on each quadcore package, could they be rescheduled at the end of their timeslice, on different cores arriving at a cold cache? On a large single image machine, could a thread be scheduled on another node and have to communicate over the off-board network? In a single image machine (with a single address space) how does the OS know to malloc memory from the on-board memory, rather than some arbitary location (on another board)?

I expect everyone here knows all this. How is everyone going to program the quadcore machines?

Thanks Joe

--
Gerry Creager -- [EMAIL PROTECTED]
Texas Mesonet -- AATLT, Texas A&M University        
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to