The machines are running the 2.6 kernel and I have confirmed that the max
TCP send/recv buffer sizes are 4MB (more than enough to store the full
512x512 image).
the bandwidth-delay product in a lan is low enough to not need
this kind of tuning.
I loop with the client side program sending a single integer to rank 0, then
rank 0 broadcasts this integer to the other nodes, and then all nodes send
back 1MB / N of data.
hmm, that's a bit harsh, don't you think? why not have the rank0/master
as each slave for its contribution sequentially? sure, it introduces a bit
of "dead air", but it's not as if two slaves can stream to a single master
at once anyway (each can saturate its link, therefore the master's link is
N-times overcommitted.)
To make sure there was not an issue with the MPI broadcast, I did one test
run with 5 nodes only sending back 4 bytes of data each. The result was a
RTT of less than 0.3 ms.
isn't that kind of high? a single ping-pong latency should be ~50 us -
maybe I'm underestimating the latency of the broadcast itself.
One interesting pattern I noticed is that the hiccup frame RTTs, almost
without exception, fall into one of three ranges (approximately 50-60,
200-210, and 250-260). Could this be related to exponential back-off?
perhaps introduced by the switch, or perhaps by the fact that the bcast
isn't implemented as an atomic (eth-level) broadcast.
Tommorow I will experiment with jumbo frames and flow control settings (both
of which the HP Procurve claims to support). If these do not solve the
problems I will start sifting through tcpdump.
I would simply serialize the slaves' responses first. the current design
tries to trigger all the slaves to send results at once, which is simply
not logical if you think about it, since any one slave can saturate
the master's link.
regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf