Hi Amjad, list
amjad ali wrote:
Hi,
Gus--thank you.
You are right. I mainly have to run programs on a small cluster (GiGE)
dedicated for my job only; and sometimes I might get some opportunity to
run my code on a shared cluster with hundreds of nodes.
Thanks for telling.
My guess was not very far from true. :)
BTW, if you are doing cluster control, I/O and MPI all across the same
GigE network, and if your nodes have dual GigE ports, you may consider
buying a second switch (or using VLAN on your existing switch,
if it has it) to deploy a second GigE network only for MPI.
In case you haven't done this yet, of course.
The cost is very modest, and the performance should improve.
OpenMPI and MPICH can select which network they use, leaving the other
one for I/O and control.
My parallel CFD application involves (In its simplest form):
1) reading of input and mesh data from few files by the master process (I/O)
2) Sending the full/respective data to all other processes (MPI
Broadcast/Scatter)
3) Share the values at subdomains (created by Metis) boundaries at the
end of each iteration (MPI Message Passing)
4) On converge, send/Gather the results from all processes to master
process
5) Writing the results to files by the master process (I/O).
So I think my program is not I/O intensive; so the Funneling I/O through
the master process is sufficient for me. Right?
This scheme doesn't differ much from most
atmosphere/ocean/climate models we run here.
(They call the programs "models" in this community.)
After all, part of the computations here are also CFD-type,
although with reduced forms of the Navier-Stokes equation
in a rotating planet.
(Other computations are not CFD, e.g. radiative transfer.)
We tend to output data every 4000 time steps (order of magnitude),
and in extreme, rare, cases every 40 time steps or so.
There is a lot of computation on each time step, and the usual
exchange of boundary values across subdomains using MPI
(i.e. we also use domain decomposition).
You may have adaptive meshes though,
whereas most of our models use fixed grids.
For this pattern of work and this ratio of
computation-to-communication-to-I/O,
the models that work best are those that funnel I/O through
the "master" processor.
My guess is that this scheme would work OK for you also,
since you seem to output data only "on convergence" (to a steady state
perhaps?).
I presume this takes many time steps,
and involves a lot of computation, and a significant amount of MPI
communication, right?
But now I have to parallelize a new serial code, which plots the results
while running (online/live display). Means that it shows the plots
of three/four variables (in small windows) while running and we see it
as video (all progress from initial stage to final stage). I assume that
this time much more I/O is involved. At the end of each iteration result
needs to be gathered from all processes to the master process. And
possibly needs to be written in files as well (I am not sure). Do we
need to write it on some file/s for online display, after each
iteration/time-step?
Do you somehow use the movie results to control the program,
change its parameters, or its course of action?
If you do, then the "real time" feature is really required.
Otherwise, you could process the movie offline after the run ends,
although this would spoil the fun of seeing it live, no doubt about it.
I suppose a separate program shows the movie, right?
Short from a more sophisticated solution using MPI-I/O and perhaps
relying on a parallel file system, you could dump the
subdomain snapshots to the nodes' local disks, then run
a separate program to harvest this data, recompose the frames into
the global domain, and exhibit the movie.
(Using MPI types may help rebuild the frames on the global domain.)
If the snapshots are not dumped very often, funneling them through the
master processor using MPI_Gather[v], and letting the master processor
output the result, would be OK also.
Regardless of how you do it, I doubt you need one snapshot for each
time step. It should be much less, as you say below.
I think (as serial code will be displaying result after each
iteration/time step), I should display result online after 100
iterations/time-steps in my parallel version so less "I/O" and/or
"funneling I/O through master process" will be required.
Any opinion/suggestion?
Yes, you may want to decimate the number of snapshots that you dump to
file, to avoid I/O at every time step.
How many time steps between snapshots?
It depends on how fast the algorithm moves the solution, I would guess.
It should be an interval short enough to provide smooth transitions from
frame to frame, but long enough to avoid too much I/O.
You may want to leave this number as a (Fortran namelist) parameter
that you can choose at runtime.
Movies are 24 frames per second (at least before the digital era).
Jean-Luc Goddard once said:
"Photography is truth. Cinema is truth twenty-four times per second."
Of course he also said:
"Cinema is the most beautiful fraud in the world.".
But you don't need to tell your science buddies or your adviser about
that ... :)
Good luck!
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------
regards,
Amjad Ali.
On Wed, Jul 1, 2009 at 5:06 AM, Gus Correa <g...@ldeo.columbia.edu
<mailto:g...@ldeo.columbia.edu>> wrote:
Hi Bogdan, list
Oh, well, this is definitely a peer reviewed list.
My answers were given in the context of Amjad's original
questions, and the perception, based on Amjad's previous
and current postings, that he is not dealing with a large cluster,
or with many users, and plans to both parallelize and update his
code from F77 to F90, which can be quite an undertaking.
Hence, he may want to follow the path of least resistance,
rather than aim at the fanciest programming paradigm.
In the edited answer that context was stripped off,
and so was the description of
"brute force" I/O in parallel jobs.
That was the form of concurrent I/O I was referring to.
An I/O mode which doesn't take any precautions
to avoid file and network contention, and unfortunately is more common
than clean, well designed, parallel I/O code (at least on the field
I work).
That was the form of concurrent I/O I was referring to
(all processors try to do I/O at the same time using standard
open/read/write/close commands provided by Fortran or another language,
not MPI calls).
Bogdan seems to be talking about programs with well designed
parallel I/O instead.
Bogdan Costescu wrote:
On Wed, 24 Jun 2009, Gus Correa wrote:
the "master" processor reads... broadcasts parameters that
are used by all "slave" processors, and scatters any data
that will be processed in a distributed fashion by each
"slave" processor.
...
That always works, there is no file system contention.
I beg to disagree. There is no file system contention if this
job is the only one doing the I/O at that time, which could be
the case if a job takes the whole cluster. However, in a more
conventional setup with several jobs running at the same time,
there is I/O done from several nodes (running the MPI rank 0 of
each job) at the same time, which will still look like mostly
random I/O to the storage.
Indeed, if there are 1000 jobs running,
even if each one is funneling I/O through
the "master" processor, there will be a large number of competing
requests to the I/O system, hence contention.
However, contention would also happen if all jobs were serial.
Hence, this is not a problem caused by or specific from parallel jobs.
It is an intrinsic limitation of the I/O system.
Nevertheless, what if these 1000 jobs are running on the same cluster,
but doing "brute force" I/O through
each of their, say, 100 processes?
Wouldn't file and network contention be larger than if the jobs were
funneling I/O through a single processor?
That is the context in which I made my statement.
Funneling I/O through a "master" processor reduces the chances of file
contention because it minimizes the number of processes doing I/O,
or not?
Another drawback is that you need to write more code for the
I/O procedure.
I also disagree here. The code doing I/O would need to only
happen on MPI rank 0, so no need to think for the other ranks
about race conditions, computing a rank-based position in the
file, etc.
>From what you wrote,
you seem to agree with me on this point, not disagree.
1) Brute force I/O through all ranks takes little programming effort,
the code is basically the same serial,
and tends to trigger file contention (and often breaks NFS, etc).
2) Funneling I/O through the master node takes a moderate programming
effort. One needs to gather/scatter data through the "master"
processor, which concentrates the I/O, and reduces contention.
3) Correct and cautious parallel I/O across all ranks takes a larger
programming effort,
due to the considerations you pointed out above.
In addition, MPI is in control of everything, you are less
dependent on NFS quirks.
... or cluster design. I have seen several clusters which were
designed with 2 networks, a HPC one (Myrinet or Infiniband) and
GigE, where the HPC network had full bisection bandwidth, but
the GigE was a heavily over-subscribed one as the design really
thought only about MPI performance and not about I/O
performance. In such an environment, it's rather useless to try
to do I/O simultaneously from several nodes which share the same
uplink, independent whether the storage is a single NFS server
or a parallel FS. Doing I/O from only one node would allow full
utilization of the bandwidth on the chain of uplinks to the
file-server and the data could then be scattered/gathered fast
through the HPC network. Sure, a more hardware-aware application
could have been more efficient (f.e. if it would be possible to
describe the network over-subscription so that as many uplinks
could be used simultaneously as possible), but a more balanced
cluster design would have been even better...
Absolutely, but the emphasis I've seen, at least for small clusters
designed for scientific computations in a small department or
research group is to pay less attention to I/O that I had the chance
to know about.
When one gets to the design of the filesystems and I/O the budget is
already completely used up to buy a fast interconnect for MPI.
I/O is then done over Gigabit Ethernet using a single NFS
file server (often times a RAID on the head node itself).
For the scale of a small cluster, with a few tens of nodes or so,
this may work OK,
as long as one writes code that is gentle with NFS
(e.g. by funneling I/O through the head node).
Obviously the large clusters on our national labs and computer centers
do take into consideration I/O requirements, parallel file systems,
etc. However, that is not my reality here, and I would guess it is
not Amjad's situation either.
[ parallel I/O programs ] always cause a problem when the
number of processors is big.
Sorry, but I didn't say parallel I/O programs.
I said brute force I/O by all processors (using standard NFS,
no parallel file system, all processors banging on the file system
with no coordination).
I'd also like to disagree here. Parallel file systems teach us
that a scalable system is one where the operations are split
between several units that do the work. Applying the same
knowledge to the generation of the data, a scalable application
is one for which the I/O operations are done as much as possible
split between the ranks.
Yes.
If you have a parallel file system.
IMHO, the "problem" that you see is actually caused by reaching
the limits of your cluster, IOW this is a local problem of that
particular cluster and not a problem in the application. By
re-writing the application to make it more NFS-friendly (f.e.
like the above "rank 0 does all I/O"), you will most likely kill
scalability for another HPC setup with a distributed/parallel
storage setup.
Yes, that is true, but may only be critical if the program is I/O
intensive (ours are not).
One may still fare well with funneling I/O through one or a few
processors, if the program is not I/O intensive,
and not compromise scalability.
The opposite, however, i.e.,
writing the program expecting the cluster to
provide a parallel file system,
is unlikely to scale well on a cluster
without one, or not?
Often times these codes were developed on big iron machines,
ignoring the hurdles one has to face on a Beowulf.
Well, the definition of Beowulf is quite fluid. Nowadays is
sufficiently easy to get a parallel FS running with commodity
hardware that I wouldn't associate it anymore with big iron.
That is true, but very budget dependent.
If you are on a shoestring budget, and your goal is to do parallel
computing, and your applications are not particularly I/O intensive,
what would you prioritize: a fast interconnect for MPI,
or hardware and software for a parallel file system?
In general they don't use MPI parallel I/O either
Being on the teaching side in a recent course+practical work
involving parallel I/O, I've seen computer science and physics
students making quite easily the transition from POSIX I/O done
on a shared file system to MPI-I/O. They get sometimes an index
wrong, but mostly the conversion is painless. After that, my
impression has become that it's mostly lazyness and the attitude
'POSIX is everywhere anywhere, why should I bother with
something that might be missing' that keeps applications at this
stage.
I agree with your considerations about laziness and the POSIX-inertia.
However, there is still a long way to make programs and programmers
at least consider the restrictions imposed by network and file systems,
not to mention to use proper parallel I/O.
Hopefully courses like yours will improve this.
If I could, I would love to go to Heidelberg and take your class myself!
Regards,
Gus Correa
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
<mailto:Beowulf@beowulf.org> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf