On Wed, 24 Jun 2009, Gus Correa wrote:

the "master" processor reads... broadcasts parameters that are used by all "slave" processors, and scatters any data that will be processed in a distributed fashion by each "slave" processor.
...
That always works, there is no file system contention.

I beg to disagree. There is no file system contention if this job is the only one doing the I/O at that time, which could be the case if a job takes the whole cluster. However, in a more conventional setup with several jobs running at the same time, there is I/O done from several nodes (running the MPI rank 0 of each job) at the same time, which will still look like mostly random I/O to the storage.

Another drawback is that you need to write more code for the I/O procedure.

I also disagree here. The code doing I/O would need to only happen on MPI rank 0, so no need to think for the other ranks about race conditions, computing a rank-based position in the file, etc.

In addition, MPI is in control of everything, you are less dependent on NFS quirks.

... or cluster design. I have seen several clusters which were designed with 2 networks, a HPC one (Myrinet or Infiniband) and GigE, where the HPC network had full bisection bandwidth, but the GigE was a heavily over-subscribed one as the design really thought only about MPI performance and not about I/O performance. In such an environment, it's rather useless to try to do I/O simultaneously from several nodes which share the same uplink, independent whether the storage is a single NFS server or a parallel FS. Doing I/O from only one node would allow full utilization of the bandwidth on the chain of uplinks to the file-server and the data could then be scattered/gathered fast through the HPC network. Sure, a more hardware-aware application could have been more efficient (f.e. if it would be possible to describe the network over-subscription so that as many uplinks could be used simultaneously as possible), but a more balanced cluster design would have been even better...

[ parallel I/O programs ] always cause a problem when the number of processors is big.

I'd also like to disagree here. Parallel file systems teach us that a scalable system is one where the operations are split between several units that do the work. Applying the same knowledge to the generation of the data, a scalable application is one for which the I/O operations are done as much as possible split between the ranks.

IMHO, the "problem" that you see is actually caused by reaching the limits of your cluster, IOW this is a local problem of that particular cluster and not a problem in the application. By re-writing the application to make it more NFS-friendly (f.e. like the above "rank 0 does all I/O"), you will most likely kill scalability for another HPC setup with a distributed/parallel storage setup.

Often times these codes were developed on big iron machines, ignoring the hurdles one has to face on a Beowulf.

Well, the definition of Beowulf is quite fluid. Nowadays is sufficiently easy to get a parallel FS running with commodity hardware that I wouldn't associate it anymore with big iron.

In general they don't use MPI parallel I/O either

Being on the teaching side in a recent course+practical work involving parallel I/O, I've seen computer science and physics students making quite easily the transition from POSIX I/O done on a shared file system to MPI-I/O. They get sometimes an index wrong, but mostly the conversion is painless. After that, my impression has become that it's mostly lazyness and the attitude 'POSIX is everywhere anywhere, why should I bother with something that might be missing' that keeps applications at this stage.

--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.coste...@iwr.uni-heidelberg.de
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to