Re: [Beowulf] Parallel Programming Question

Gus Correa Wed, 01 Jul 2009 12:47:54 -0700

Hi Amjad, list

amjad ali wrote:

Hi,
Gus--thank you.
You are right. I mainly have to run programs on a small cluster (GiGE)dedicated for my job only; and sometimes I might get some opportunity torun my code on a shared cluster with hundreds of nodes.


Thanks for telling.
My guess was not very far from true. :)

BTW, if you are doing cluster control, I/O and MPI all across the same
GigE network, and if your nodes have dual GigE ports, you may consider
buying a second switch (or using VLAN on your existing switch,
if it has it) to deploy a second GigE network only for MPI.
In case you haven't done this yet, of course.
The cost is very modest, and the performance should improve.

OpenMPI and MPICH can select which network they use, leaving the otherone for I/O and control.

My parallel CFD application involves (In its simplest form):
1) reading of input and mesh data from few files by the master process (I/O)
2) Sending the full/respective data to all other processes (MPIBroadcast/Scatter)3) Share the values at subdomains (created by Metis) boundaries at theend of each iteration (MPI Message Passing)4) On converge, send/Gather the results from all processes to masterprocess
5) Writing the results to files by the master process (I/O).
So I think my program is not I/O intensive; so the Funneling I/O throughthe master process is sufficient for me. Right?


This scheme doesn't differ much from most
atmosphere/ocean/climate models we run here.
(They call the programs "models" in this community.)
After all, part of the computations here are also CFD-type,
although with reduced forms of the Navier-Stokes equation
in a rotating planet.
(Other computations are not CFD, e.g. radiative transfer.)

We tend to output data every 4000 time steps (order of magnitude),
and in extreme, rare, cases every 40 time steps or so.
There is a lot of computation on each time step, and the usual
exchange of boundary values across subdomains using MPI
(i.e. we also use domain decomposition).
You may have adaptive meshes though,
whereas most of our models use fixed grids.

For this pattern of work and this ratio ofcomputation-to-communication-to-I/O,

the models that work best are those that funnel I/O through
the "master" processor.

My guess is that this scheme would work OK for you also,

since you seem to output data only "on convergence" (to a steady stateperhaps?).

I presume this takes many time steps,

and involves a lot of computation, and a significant amount of MPIcommunication, right?

But now I have to parallelize a new serial code, which plots the resultswhile running (online/live display). Means that it shows the plotsof three/four variables (in small windows) while running and we see itas video (all progress from initial stage to final stage). I assume thatthis time much more I/O is involved. At the end of each iteration resultneeds to be gathered from all processes to the master process. Andpossibly needs to be written in files as well (I am not sure). Do weneed to write it on some file/s for online display, after eachiteration/time-step?


Do you somehow use the movie results to control the program,
change its parameters, or its course of action?
If you do, then the "real time" feature is really required.
Otherwise, you could process the movie offline after the run ends,
although this would spoil the fun of seeing it live, no doubt about it.

I suppose a separate program shows the movie, right?
Short from a more sophisticated solution using MPI-I/O and perhaps
relying on a parallel file system, you could dump the
subdomain snapshots to the nodes' local disks, then run
a separate program to harvest this data, recompose the frames into
the global domain, and exhibit the movie.
(Using MPI types may help rebuild the frames on the global domain.)
If the snapshots are not dumped very often, funneling them through the

master processor using MPI_Gather[v], and letting the master processoroutput the result, would be OK also.


Regardless of how you do it, I doubt you need one snapshot for each
time step. It should be much less, as you say below.

I think (as serial code will be displaying result after eachiteration/time step), I should display result online after 100iterations/time-steps in my parallel version so less "I/O" and/or"funneling I/O through master process" will be required.
Any opinion/suggestion?


Yes, you may want to decimate the number of snapshots that you dump to
file, to avoid I/O at every time step.
How many time steps between snapshots?
It depends on how fast the algorithm moves the solution, I would guess.

It should be an interval short enough to provide smooth transitions fromframe to frame, but long enough to avoid too much I/O.

You may want to leave this number as a (Fortran namelist) parameter
that you can choose at runtime.

Movies are 24 frames per second (at least before the digital era).

Jean-Luc Goddard once said:
"Photography is truth. Cinema is truth twenty-four times per second."

Of course he also said:
"Cinema is the most beautiful fraud in the world.".

But you don't need to tell your science buddies or your adviser aboutthat ... :)


Good luck!

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

regards,
Amjad Ali.

On Wed, Jul 1, 2009 at 5:06 AM, Gus Correa <g...@ldeo.columbia.edu<mailto:g...@ldeo.columbia.edu>> wrote:


    Hi Bogdan, list

    Oh, well, this is definitely a peer reviewed list.
    My answers were given in the context of Amjad's original
    questions, and the perception, based on Amjad's previous
    and current postings, that he is not dealing with a large cluster,
    or with many users, and plans to both parallelize and update his
    code from F77 to F90, which can be quite an undertaking.
    Hence, he may want to follow the path of least resistance,
    rather than aim at the fanciest programming paradigm.

    In the edited answer that context was stripped off,
    and so was the description of
    "brute force" I/O in parallel jobs.
    That was the form of concurrent I/O I was referring to.
    An I/O mode which doesn't take any precautions
    to avoid file and network contention, and unfortunately is more common
    than clean, well designed, parallel I/O code (at least on the field
    I work).

    That was the form of concurrent I/O I was referring to
    (all processors try to do I/O at the same time using standard
    open/read/write/close commands provided by Fortran or another language,
    not MPI calls).

    Bogdan seems to be talking about programs with well designed
    parallel I/O instead.


    Bogdan Costescu wrote:

        On Wed, 24 Jun 2009, Gus Correa wrote:

            the "master" processor reads... broadcasts parameters that
            are used by all "slave" processors, and scatters any data
            that will be processed in a distributed fashion by each
            "slave" processor.
            ...
            That always works, there is no file system contention.


        I beg to disagree. There is no file system contention if this
        job is the only one doing the I/O at that time, which could be
        the case if a job takes the whole cluster. However, in a more
        conventional setup with several jobs running at the same time,
        there is I/O done from several nodes (running the MPI rank 0 of
        each job) at the same time, which will still look like mostly
        random I/O to the storage.


    Indeed, if there are 1000 jobs running,
    even if each one is funneling I/O through
    the "master" processor, there will be a large number of competing
    requests to the I/O system, hence contention.
    However, contention would also happen if all jobs were serial.
    Hence, this is not a problem caused by or specific from parallel jobs.
    It is an intrinsic limitation of the I/O system.

    Nevertheless, what if these 1000 jobs are running on the same cluster,
    but doing "brute force" I/O through
    each of their, say, 100 processes?
    Wouldn't file and network contention be larger than if the jobs were
    funneling I/O through a single processor?

    That is the context in which I made my statement.
    Funneling I/O through a "master" processor reduces the chances of file
    contention because it minimizes the number of processes doing I/O,
    or not?


            Another drawback is that you need to write more code for the
            I/O procedure.


        I also disagree here. The code doing I/O would need to only
        happen on MPI rank 0, so no need to think for the other ranks
        about race conditions, computing a rank-based position in the
        file, etc.


     >From what you wrote,
    you seem to agree with me on this point, not disagree.

    1) Brute force I/O through all ranks takes little programming effort,
    the code is basically the same serial,
    and tends to trigger file contention (and often breaks NFS, etc).
    2) Funneling I/O through the master node takes a moderate programming
    effort.  One needs to gather/scatter data through the "master"
    processor, which concentrates the I/O, and reduces contention.
    3) Correct and cautious parallel I/O across all ranks takes a larger
    programming effort,
    due to the considerations you pointed out above.


            In addition, MPI is in control of everything, you are less
            dependent on NFS quirks.


        ... or cluster design. I have seen several clusters which were
        designed with 2 networks, a HPC one (Myrinet or Infiniband) and
        GigE, where the HPC network had full bisection bandwidth, but
        the GigE was a heavily over-subscribed one as the design really
        thought only about MPI performance and not about I/O
        performance. In such an environment, it's rather useless to try
        to do I/O simultaneously from several nodes which share the same
        uplink, independent whether the storage is a single NFS server
        or a parallel FS. Doing I/O from only one node would allow full
        utilization of the bandwidth on the chain of uplinks to the
        file-server and the data could then be scattered/gathered fast
        through the HPC network. Sure, a more hardware-aware application
        could have been more efficient (f.e. if it would be possible to
        describe the network over-subscription so that as many uplinks
        could be used simultaneously as possible), but a more balanced
        cluster design would have been even better...


    Absolutely, but the emphasis I've seen, at least for small clusters
    designed for scientific computations in a small department or
    research group is to pay less attention to I/O that I had the chance
    to know about.
    When one gets to the design of the filesystems and I/O the budget is
    already completely used up to buy a fast interconnect for MPI.
    I/O is then done over Gigabit Ethernet using a single NFS
    file server (often times a RAID on the head node itself).
    For the scale of a small cluster, with a few tens of nodes or so,
    this may work OK,
    as long as one writes code that is gentle with NFS
    (e.g. by funneling I/O through the head node).

    Obviously the large clusters on our national labs and computer centers
    do take into consideration I/O requirements, parallel file systems,
    etc. However, that is not my reality here, and I would guess it is
    not Amjad's situation either.


            [ parallel I/O programs ] always cause a problem when the
            number of processors is big.



    Sorry, but I didn't say parallel I/O programs.
    I said brute force I/O by all processors (using standard NFS,
    no parallel file system, all processors banging on the file system
    with no coordination).


        I'd also like to disagree here. Parallel file systems teach us
        that a scalable system is one where the operations are split
        between several units that do the work. Applying the same
        knowledge to the generation of the data, a scalable application
        is one for which the I/O operations are done as much as possible
        split between the ranks.


    Yes.
    If you have a parallel file system.



        IMHO, the "problem" that you see is actually caused by reaching
        the limits of your cluster, IOW this is a local problem of that
        particular cluster and not a problem in the application. By
        re-writing the application to make it more NFS-friendly (f.e.
        like the above "rank 0 does all I/O"), you will most likely kill
        scalability for another HPC setup with a distributed/parallel
        storage setup.

    Yes, that is true, but may only be critical if the program is I/O
    intensive (ours are not).
    One may still fare well with funneling I/O through one or a few
    processors, if the program is not I/O intensive,
    and not compromise scalability.

    The opposite, however, i.e.,
    writing the program expecting the cluster to
    provide a parallel file system,
    is unlikely to scale well on a cluster
    without one, or not?


            Often times these codes were developed on big iron machines,
            ignoring the hurdles one has to face on a Beowulf.


        Well, the definition of Beowulf is quite fluid. Nowadays is
        sufficiently easy to get a parallel FS running with commodity
        hardware that I wouldn't associate it anymore with big iron.


    That is true, but very budget dependent.
    If you are on a shoestring budget, and your goal is to do parallel
    computing, and your applications are not particularly I/O intensive,
    what would you prioritize: a fast interconnect for MPI,
    or hardware and software for a parallel file system?


            In general they don't use MPI parallel I/O either


        Being on the teaching side in a recent course+practical work
        involving parallel I/O, I've seen computer science and physics
        students making quite easily the transition from POSIX I/O done
        on a shared file system to MPI-I/O. They get sometimes an index
        wrong, but mostly the conversion is painless. After that, my
        impression has become that it's mostly lazyness and the attitude
        'POSIX is everywhere anywhere, why should I bother with
        something that might be missing' that keeps applications at this
        stage.


    I agree with your considerations about laziness and the POSIX-inertia.
    However, there is still a long way to make programs and programmers
    at least consider the restrictions imposed by network and file systems,
    not to mention to use proper parallel I/O.
    Hopefully courses like yours will improve this.
    If I could, I would love to go to Heidelberg and take your class myself!

    Regards,
    Gus Correa


    _______________________________________________
    Beowulf mailing list, Beowulf@beowulf.org
    <mailto:Beowulf@beowulf.org> sponsored by Penguin Computing
    To change your subscription (digest mode or unsubscribe) visit
    http://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Parallel Programming Question

Reply via email to