[Beowulf] Load Balance Shifts During Run of Fixed Balance Application

Michael H. Frese Fri, 02 Mar 2007 14:38:00 -0800

We have a parallel problem that shifts its load balance while executingeven though we are certain that it shouldn't. The following will describeour experience level, our clusters, our application, and the problem.


Our Experience

We are the developers of an MPI parallel application -- a 2-dtime-dependent multiphysics code -- with all the intimate knowledge of itsarchitecture and implementation that implies. We are presently using thePortland Group Fortran and C compilers and MPICH-1 version 1.2.7. We havehad success building and using other parallel applications on HPC systemsand clusters of workstations, though in those cases the physics was3-d. We have plenty of Linux workstation sysadmin experience.


Our House-Built Clusters

We have built a few, small, generally heterogeneous clusters ofworkstations around AMD processors, Netgear GA311 NICs, and differentswitches. We used Redhat 8 and 9 for our 32-bit processors, and haveshifted to Fedora for our recent systems including our few ventures into64-bit land. Some of our nodes have dual processors. We have not tunedthe OSs at all, other than to be sure that our NICs have appropriatedrivers. Some of our switches give us 80-90% of Gb speed as measured byNetPipe, both TCP-IP and MPI, and others give us 30%. In the casedescribed here, the switch is a slower one, but the application'sperformance is determined by the latency since the messages are relativelysmall. Our only performance tools are the LINUX utility top and a stopwatch.


Our Application Architecture and Performance Expectation

During execution, the application takes thousands of steps that eachadvance simulation time. The processors advance through the differentphysics packages and parts thereof in lock step from one MPIWaitAll to thenext, with limited amounts of work being done between the barriers. We useMPIAllReduce to do maximums, minimums, and sums of various quantities.

The application uses a domain decomposition that does not change duringeach run. Each time step is roughly the same amount of work as previousones, though the number of iterations in the implicit solution methodschanges. However, all processors are taking the same number of iterationsin each time step. Thus we expect that the relative load on a processorwill remain roughly the same as the relative size of the domain it isassigned in the decomposition. The problem is that it doesn't.

There is one exception to our expectation, in that intermittently aftersome number of time steps or interval of simulation time, the applicationdoes output. Each processor writes some dump files identified with itsnode number to a problem directory, and a single processor combines thosefiles into one while all the other processors wait. By controlling thefrequency of the output, we keep the total time lost in this waitrelatively small. In addition, every ten cycles, the output processorwrites a brief summary of the problem state to the terminal output.

One more thing before we get to the problem. We don't use mpirun; ourapplication reads a processor group file and starts the remote processesitself. Thus, there is one processor that is distinguished from theothers: it was directly invoked from the command line of a shell -- usuallytcsh, but never mind that religious war.


The Problem

We have observed unexpected and extreme load-balance shifts during bothtwo- and four-processor runs. In the following, our focus will be on thefour processor run. We observe the load balance by monitoring CPU usage oneach of the processors with separate xterm-invoked tops from a non-clustermachine. Our primary observable is %CPU; as a secondary observable, wemonitor the wall time interval between the 10-cycle terminal edit.

The load balance starts out looking like the relative sizes of the domainswe assigned to the various processors, just as we expect. The processor onwhich the run was started has the smallest domain to handle, and its %CPUis initially around 50%, while the others are around 90%. After a fewhundred time steps or so the CPU usage of the processor on which the jobwas started begins to increase and the others begin to fall. After athousand time steps or so, the CPU usage is nearly 90% for the originatingprocess, and less than 20% for the remote processes. Not surprisingly, thewall time between 10-cycle terminal edits goes up by a factor of 4 over thesame period. By observation, no other task ever consumes more than a fewtenths of a percent of the CPU.

The originating processor is the output processor, but only the terminaloutput is happening during this period, and we observe no significantchange in the CPU usage during the cycles when that output isproduced. Top is updating its output every 5 seconds and in this run ourapplication is taking one time step every 2 seconds. The message count andsize of the messages imply that two processors are spending about 30% oftheir time in system time for message startup and about a tenth that muchactually transmitting data. There are about 6,000 messages sent andreceived in each time step on those processors, though it varies slightlyfrom time step to time step. The other two processors -- one of which isthe originating processor -- have about half that many messages to send andreceive, and spend correspondingly less time doing it.

Though we have shuffled the originating processor and the processors in thegroup the results are always similar. In one case we ran with fouridentical nodes except that one had Redhat 8 while the others were Redhat9. In another case we ran four Redhat 9 machines with slightly differentAMD processor speeds (2.08 vs 2.16 GHz). The 9.0 kernels are 2.4.20, whilethe 8.0 has been upgraded to 2.4.18.

Here is a final bit of data. To prove that the shift was not determined bythe state of the problem being simulated, we restarted the simulation froma restart dump made by our application when the load had shifted to theoriginating processor. The load balance immediately after the restartagain reflected the domain size as it had in the beginning of theunrestarted simulation. After a thousand cycles in the restarted problem,the load had shifted back to the originating processor.


Conclusion/Hypothesis

Our tentative conclusion is that either MPICH or the operating system iseating an increasing amount of time on the originating processor as thenumber of time steps accumulates. It is probable that the accumulatednumber of messages transmitted is the problem. It acts like a leak, but ofprocessor CPU time rather than memory. Top does not show any increase inresident set size (RSS) during the run.

Does anyone have any ideas what this behavior might be, how we can test forit, and what we can do to fix it? Thanks for any help in advance.

Mike

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Load Balance Shifts During Run of Fixed Balance Application

Reply via email to