As has been pointed out to me offline, my numbers may be a bit more
pessimistic than needed, in part to pipelining and other effects. If my
numbers were the result of a correct analysis, the most you would be
able to see from a gigabit link would be about 37 MB/s for 1500 byte
packets. This is obviously not the case, so assume this to be a "worst
case" analysis (and I am going to go back and review what I seem to
have dropped from the TCP bits).
Joe
Joe Landman wrote:
Hi Brendan:
Brendan Moloney wrote:
I have a cluster of 8 Linux machines connected with gigabit
ethernet (full duplex) to a HP Procurve 2848 switch. I am using the
machines to do interactive distributed rendering. I have noticed that
the
final gather stage (where the intermediate images from the render
nodes are
sent back to the viewing node) has "hiccups" in the performance. These
How are they sent? NFS? Sockets? ...
hiccups occur with as few as two render nodes, and become more common
as I
add more render nodes. With a 512x512 image the final gather usually
takes
a few milliseconds for each frame, but when the hiccups occur it is more
like 200+ milliseconds.
Is this "real time" rendering so that frame rate isthe most important
aspect?
Since it is a full duplex switched network, there should not be any
collisions happening. Since the image is less than 1 MB total, I don't
There could be blocking ... if one unit grabs the single network pipe
of the display node while the another node tries to send data, then the
late node will back off (well with TCP it will) in a pre-determined manner.
think I am saturating the switch. I have checked the contents of
/sbin/ifconfig and there are zero erroneous packets being reported.
At this
You wouldn't see it there. It would be on the switch, and even then it
wouldn't term it a collision. It is a switch behaving normally.
point I am really at a loss as to what is causing this. Any input on
things
to check would be greatly appreciated.
I assume you have a single gigabit from the display node to the switch.
As you scale up the number of render nodes, you notice more of these
"hiccups" scaling about linearly with the number of nodes.
This suggests resource contention. Each image would be fragmented into
units of 175 1500-byte packets. This assumes 8 bit images. If you are
using 8 bits per color, 3 colors and an alpha channel, then this is ~700
packets. Each 1500 byte packet takes about 11us to transmit, and has a
non-trivial latency associated with it. I will estimate the latency at
30us (this is switch latency of ~ 5us + network stack latency on each
side of about 12.5us). So for each packet, you have about 41us to
transfer it. If you have 8 bit images, then this corresponds to 7.2
ms. There may be some other caching effects that I am missing, or
mis-computed. For 32 bits (3x 8bit color channels + 1 alpha channel),
this is looking like 28.8 ms for each image. Best case you could do
with this is about 34.7 frames per second.
If on the other hand, you used jumbo frames with 9000 byte packets, you
would need 30 to transfer each image, which would require 67.1us to
move, and still 30 us of latency, for 97.1us per packet. For 30
packets, this is 2.9ms. For the 32 bit version as indicated previously
(3x 8 bit color channels, and one alpha channel) this would be about
11.6ms. Or 85.9 frames per second.
Based on this, I would suggest seeing if changing mtu to 9000 helps.
ifconfig eth0 mtu 9000
on all your nodes (every one).
The argument for this is that you have less latency to pay for, even
though it takes longer to transfer the payload.
Another possibility is channel bonding on your display node.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [EMAIL PROTECTED]
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf