[Beowulf] Load Balance Shifts During Run of Fixed Balance Application [RESOLVED]

Michael H. Frese Mon, 05 Mar 2007 11:57:13 -0800

Thanks to those who took the time to consider my original description ofour problem. It has now been resolved and the simulation load balance isremaining fixed over thousands of time steps.

The problem, not surprisingly, was in our application code, specifically inour use of MPI in one particular place. We had posted some receives on theoriginating processor -- which was also the output processor -- formessages that were never sent. We failed to detect the error because -- inanother error -- we had failed to do a WaitAll on the receive message queuefor those messages. The result was that the originating/output processorhad an ever increasing receive queue to hunt through while pairing upreceives and arriving messages, and so took increasingly longer with eachsuccessive timestep.

We also sent some messages to processors that did not exist, though I thinkthis was less of a problem.

We found the problem by looking for one a related kind. We built and ran atest code, and found accidently that failing to post receives causedprocessors to have to hunt through an increasing queue of received butunprocessed messages.


Thanks again.


Mike

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Load Balance Shifts During Run of Fixed Balance Application [RESOLVED]

Reply via email to