Thanks to those who took the time to consider my original description of our problem. It has now been resolved and the simulation load balance is remaining fixed over thousands of time steps.

The problem, not surprisingly, was in our application code, specifically in our use of MPI in one particular place. We had posted some receives on the originating processor -- which was also the output processor -- for messages that were never sent. We failed to detect the error because -- in another error -- we had failed to do a WaitAll on the receive message queue for those messages. The result was that the originating/output processor had an ever increasing receive queue to hunt through while pairing up receives and arriving messages, and so took increasingly longer with each successive timestep.

We also sent some messages to processors that did not exist, though I think this was less of a problem.

We found the problem by looking for one a related kind. We built and ran a test code, and found accidently that failing to post receives caused processors to have to hunt through an increasing queue of received but unprocessed messages.

Thanks again.


Mike

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to