Re: [Beowulf] Parallel Programming Question

Bogdan Costescu Tue, 30 Jun 2009 07:03:57 -0700

On Wed, 24 Jun 2009, Gus Correa wrote:

the "master" processor reads... broadcasts parameters that are usedby all "slave" processors, and scatters any data that will beprocessed in a distributed fashion by each "slave" processor.
...
That always works, there is no file system contention.

I beg to disagree. There is no file system contention if this job isthe only one doing the I/O at that time, which could be the case if ajob takes the whole cluster. However, in a more conventional setupwith several jobs running at the same time, there is I/O done fromseveral nodes (running the MPI rank 0 of each job) at the same time,which will still look like mostly random I/O to the storage.

Another drawback is that you need to write more code for the I/Oprocedure.

I also disagree here. The code doing I/O would need to only happen onMPI rank 0, so no need to think for the other ranks about raceconditions, computing a rank-based position in the file, etc.

In addition, MPI is in control of everything, you are less dependenton NFS quirks.

... or cluster design. I have seen several clusters which weredesigned with 2 networks, a HPC one (Myrinet or Infiniband) and GigE,where the HPC network had full bisection bandwidth, but the GigE was aheavily over-subscribed one as the design really thought only aboutMPI performance and not about I/O performance. In such an environment,it's rather useless to try to do I/O simultaneously from several nodeswhich share the same uplink, independent whether the storage is asingle NFS server or a parallel FS. Doing I/O from only one node wouldallow full utilization of the bandwidth on the chain of uplinks to thefile-server and the data could then be scattered/gathered fast throughthe HPC network. Sure, a more hardware-aware application could havebeen more efficient (f.e. if it would be possible to describe thenetwork over-subscription so that as many uplinks could be usedsimultaneously as possible), but a more balanced cluster design wouldhave been even better...

[ parallel I/O programs ] always cause a problem when the numberof processors is big.

I'd also like to disagree here. Parallel file systems teach us that ascalable system is one where the operations are split between severalunits that do the work. Applying the same knowledge to the generationof the data, a scalable application is one for which the I/Ooperations are done as much as possible split between the ranks.

IMHO, the "problem" that you see is actually caused by reaching thelimits of your cluster, IOW this is a local problem of that particularcluster and not a problem in the application. By re-writing theapplication to make it more NFS-friendly (f.e. like the above "rank 0does all I/O"), you will most likely kill scalability for another HPCsetup with a distributed/parallel storage setup.

Often times these codes were developed on big iron machines,ignoring the hurdles one has to face on a Beowulf.

Well, the definition of Beowulf is quite fluid. Nowadays issufficiently easy to get a parallel FS running with commodity hardwarethat I wouldn't associate it anymore with big iron.

In general they don't use MPI parallel I/O either

Being on the teaching side in a recent course+practical work involvingparallel I/O, I've seen computer science and physics students makingquite easily the transition from POSIX I/O done on a shared filesystem to MPI-I/O. They get sometimes an index wrong, but mostly theconversion is painless. After that, my impression has become that it'smostly lazyness and the attitude 'POSIX is everywhere anywhere, whyshould I bother with something that might be missing' that keepsapplications at this stage.


--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.coste...@iwr.uni-heidelberg.de
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Parallel Programming Question

Reply via email to