We have a parallel problem that shifts its load balance while executing even though we are certain that it shouldn't. The following will describe our experience level, our clusters, our application, and the problem.

Our Experience

We are the developers of an MPI parallel application -- a 2-d time-dependent multiphysics code -- with all the intimate knowledge of its architecture and implementation that implies. We are presently using the Portland Group Fortran and C compilers and MPICH-1 version 1.2.7. We have had success building and using other parallel applications on HPC systems and clusters of workstations, though in those cases the physics was 3-d. We have plenty of Linux workstation sysadmin experience.

Our House-Built Clusters

We have built a few, small, generally heterogeneous clusters of workstations around AMD processors, Netgear GA311 NICs, and different switches. We used Redhat 8 and 9 for our 32-bit processors, and have shifted to Fedora for our recent systems including our few ventures into 64-bit land. Some of our nodes have dual processors. We have not tuned the OSs at all, other than to be sure that our NICs have appropriate drivers. Some of our switches give us 80-90% of Gb speed as measured by NetPipe, both TCP-IP and MPI, and others give us 30%. In the case described here, the switch is a slower one, but the application's performance is determined by the latency since the messages are relatively small. Our only performance tools are the LINUX utility top and a stopwatch.

Our Application Architecture and Performance Expectation

During execution, the application takes thousands of steps that each advance simulation time. The processors advance through the different physics packages and parts thereof in lock step from one MPIWaitAll to the next, with limited amounts of work being done between the barriers. We use MPIAllReduce to do maximums, minimums, and sums of various quantities.

The application uses a domain decomposition that does not change during each run. Each time step is roughly the same amount of work as previous ones, though the number of iterations in the implicit solution methods changes. However, all processors are taking the same number of iterations in each time step. Thus we expect that the relative load on a processor will remain roughly the same as the relative size of the domain it is assigned in the decomposition. The problem is that it doesn't.

There is one exception to our expectation, in that intermittently after some number of time steps or interval of simulation time, the application does output. Each processor writes some dump files identified with its node number to a problem directory, and a single processor combines those files into one while all the other processors wait. By controlling the frequency of the output, we keep the total time lost in this wait relatively small. In addition, every ten cycles, the output processor writes a brief summary of the problem state to the terminal output.

One more thing before we get to the problem. We don't use mpirun; our application reads a processor group file and starts the remote processes itself. Thus, there is one processor that is distinguished from the others: it was directly invoked from the command line of a shell -- usually tcsh, but never mind that religious war.

The Problem

We have observed unexpected and extreme load-balance shifts during both two- and four-processor runs. In the following, our focus will be on the four processor run. We observe the load balance by monitoring CPU usage on each of the processors with separate xterm-invoked tops from a non-cluster machine. Our primary observable is %CPU; as a secondary observable, we monitor the wall time interval between the 10-cycle terminal edit.

The load balance starts out looking like the relative sizes of the domains we assigned to the various processors, just as we expect. The processor on which the run was started has the smallest domain to handle, and its %CPU is initially around 50%, while the others are around 90%. After a few hundred time steps or so the CPU usage of the processor on which the job was started begins to increase and the others begin to fall. After a thousand time steps or so, the CPU usage is nearly 90% for the originating process, and less than 20% for the remote processes. Not surprisingly, the wall time between 10-cycle terminal edits goes up by a factor of 4 over the same period. By observation, no other task ever consumes more than a few tenths of a percent of the CPU.

The originating processor is the output processor, but only the terminal output is happening during this period, and we observe no significant change in the CPU usage during the cycles when that output is produced. Top is updating its output every 5 seconds and in this run our application is taking one time step every 2 seconds. The message count and size of the messages imply that two processors are spending about 30% of their time in system time for message startup and about a tenth that much actually transmitting data. There are about 6,000 messages sent and received in each time step on those processors, though it varies slightly from time step to time step. The other two processors -- one of which is the originating processor -- have about half that many messages to send and receive, and spend correspondingly less time doing it.

Though we have shuffled the originating processor and the processors in the group the results are always similar. In one case we ran with four identical nodes except that one had Redhat 8 while the others were Redhat 9. In another case we ran four Redhat 9 machines with slightly different AMD processor speeds (2.08 vs 2.16 GHz). The 9.0 kernels are 2.4.20, while the 8.0 has been upgraded to 2.4.18.

Here is a final bit of data. To prove that the shift was not determined by the state of the problem being simulated, we restarted the simulation from a restart dump made by our application when the load had shifted to the originating processor. The load balance immediately after the restart again reflected the domain size as it had in the beginning of the unrestarted simulation. After a thousand cycles in the restarted problem, the load had shifted back to the originating processor.

Conclusion/Hypothesis

Our tentative conclusion is that either MPICH or the operating system is eating an increasing amount of time on the originating processor as the number of time steps accumulates. It is probable that the accumulated number of messages transmitted is the problem. It acts like a leak, but of processor CPU time rather than memory. Top does not show any increase in resident set size (RSS) during the run.

Does anyone have any ideas what this behavior might be, how we can test for it, and what we can do to fix it? Thanks for any help in advance.


Mike
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to