Re: [Beowulf] SATA II - PXE+NFS - diskless compute nodes

Simon Kelley Thu, 14 Dec 2006 15:19:10 -0800

Donald Becker wrote:


I'm not quite following here: It seems like you might be advocating
retransmits every half second. I'm current doing classical exponential
backoff, 1 second delay, then two, then four etc. Will that bite me?

Where are you you doing exponential back-off?

re-transmits in the TFTP server: sent a block and await thecorresponding ACK; if it doesn't arrive for timeout, re-send. This isneeded to recover from lost data packets, client retries only recoverfrom lost ACKs (at least they do in implementations which have beenimmunised against sorcerers-apprentice syndrome.)

The TFTP client will/should/might do a retry every second.  (Background:
TFTP uses "ACK" of the previous packet to mean "send the next one".  The
only way to detect this is a retry is timing.) The client might do a
re-ARP first.  In corner cases it might not reply to ARP itself.

[[ Step up on the soapbox. ]]

What idiot thought that exponential backoff was a good idea?
Exponential backoff doesn't make sense where your base time period is a
whole second and you can't tell if the reason for no response is
failure, busy network or no one listening.

My guess is that they were just copying Ethernet, where modified,
randomized exponential backoff is what makes it magically good.
Exponential backoff makes sense at the microsecond level, where you have
a collision domain and potentially 10,000 hosts on a shared ether.  Even
there the idea of "carrier sense" or 'is the network busy' is what
enables Ethernet to work at 98+% utilization rather than the 18% or 37%
theoretical of Aloha Net.  (Key difference: deaf transmitter.)

What usually happens with DHCP and PXE is that the first packet is used
getting the NIC to transmit correctly.  The second packet is used to get
the switch to start passing traffic.  The third packet get through but we
are already well into the exponential fallback.

PXE would be much better and more reliable if it started out
transmitting a burst of four DHCP packets even spaced in the first
second, then falling back to once per second.  If there is a concern
about DHCP being a high percentage of traffic in huge installations
running 10baseT, tell them to buy a server. Or, like, you know, a
router.  Because later the ARP traffic alone will dwarf a few DHCP
broadcasts.

It's probably worth differentiating DHCP and TFTP here. I guess thereason for exponential-backoff of to avoid congestion-collapse as theratio of bits-on-the-wire to useful work decreases. By the time a hostis doing TFTP the network-path should be established, so burstingpackets shouldn't be needed. Maybe delaying backoff would make sense.

I'm doing round-robin, but I don't see how to throttle active
connections: do I need to do that, or just limit total bandwidth?



Yes, you need to throttle active TFTP connections.  The clients
currently winning can turn around a next-packet request really quickly.
If a few get in lock step, the server will have the next chunk of the
file warm in the cache.  This is the start of locking out the first
loser.

You can't just let the ACKs queue up in the socket as a substitute for
deferring responses either.  You have to pull them out ASAP and mark
that client as needing a response.  This doesn't cost very much.   You
need to keep the client state structure anyway.  This is just one more
bit, plus updating the timeval that you should be keeping anyway.

All true. I'll experiment with some throttling approaches.

Cheers,

Simon.

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] SATA II - PXE+NFS - diskless compute nodes

Reply via email to