On 12/18/07, Mark Hahn <[EMAIL PROTECTED] > wrote: > > > The machines are running the 2.6 kernel and I have confirmed that the > max > > TCP send/recv buffer sizes are 4MB (more than enough to store the full > > 512x512 image). > > the bandwidth-delay product in a lan is low enough to not need > this kind of tuning.
I didn't actually do any tuning, I just checked the max buffer size that the linux auto-tuning can use is sufficient. > I loop with the client side program sending a single integer to rank 0, > then > > rank 0 broadcasts this integer to the other nodes, and then all nodes > send > > back 1MB / N of data. > > hmm, that's a bit harsh, don't you think? why not have the rank0/master > as each slave for its contribution sequentially? sure, it introduces a > bit > of "dead air", but it's not as if two slaves can stream to a single master > at once anyway (each can saturate its link, therefore the master's link is > > N-times overcommitted.) I guess I figured that the data is relatively small compared to the bandwidth, whereas the latency for ethernet is relatively high. I also thought the switch would be able to efficiently buffer and forward the data. I am not much of a networking guy (more a graphics guy) so I realize I could be way off base here. > To make sure there was not an issue with the MPI broadcast, I did one test > > run with 5 nodes only sending back 4 bytes of data each. The result was > a > > RTT of less than 0.3 ms. > > isn't that kind of high? a single ping-pong latency should be ~50 us - > maybe I'm underestimating the latency of the broadcast itself. This is quite a bit more than a single ping-pong. The viewer sends to the master node (rank 0), and then the master node broadcasts to all other nodes, and then all nodes send back to the viewer node. I don't know if this is still seems high? > One interesting pattern I noticed is that the hiccup frame RTTs, almost > > without exception, fall into one of three ranges (approximately 50-60, > > 200-210, and 250-260). Could this be related to exponential back-off? > > perhaps introduced by the switch, or perhaps by the fact that the bcast > isn't implemented as an atomic (eth-level) broadcast. > But the bcast is always just sending 4 bytes (a single integer), and as mentioned above no hiccups occur until the size of the final gather packets (from all nodes to the viewer) is increased. > > > Tommorow I will experiment with jumbo frames and flow control settings > (both > > of which the HP Procurve claims to support). If these do not solve the > > problems I will start sifting through tcpdump. > > I would simply serialize the slaves' responses first. the current design > tries to trigger all the slaves to send results at once, which is simply > not logical if you think about it, since any one slave can saturate > the master's link. > I still have the feeling that the switch should be able to handle this more efficiently, but since your idea is relatively simple to implement I will give it a try and see what the performance is like. Thanks for your input. > > regards, mark hahn. >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf