On Thu, 14 Dec 2006, Simon Kelley wrote: > Donald Becker wrote: > > It should repeat this: forking a dozen processes sounds like a good idea. > > Thinking about forking a thousand (we plan every element to scale to "at > > least 1000") makes "1" seem like a much better idea. > > > > With one continuously running server, the coding task is harder. You > > can't leak memory. You can't leak file descriptors. You have to check for > > updated/modified files. You can't block on anything. You have to re-read > > your config file and re-open your control sockets on SIGHUP rather than > > just exiting. You should show/checkpoint the current state on SIGUSR1. > > All that stuff is there, and has been bedded down over several years. > The TFTP code is an additional 500 lines.
It's not difficult to write a TFTP server. (The "trivial" in the name is a hint for those that haven't tried it.) It's difficult to write a reliable scalable one. But you have a head start. > > Once you do have all of that written, it's now possible, even easy, to > > count have many bytes and packets were sent in the last timer tick and to > > check that every client asked for and received packet in the last half > > second. Combine the two and you can smoothly switch from bandwidth > > control to round-robin responses, then to slightly deferring DHCP > > responses. > > I'm not quite following here: It seems like you might be advocating > retransmits every half second. I'm current doing classical exponential > backoff, 1 second delay, then two, then four etc. Will that bite me? Where are you you doing exponential back-off? For the TFTP client? The TFTP client will/should/might do a retry every second. (Background: TFTP uses "ACK" of the previous packet to mean "send the next one". The only way to detect this is a retry is timing.) The client might do a re-ARP first. In corner cases it might not reply to ARP itself. [[ Step up on the soapbox. ]] What idiot thought that exponential backoff was a good idea? Exponential backoff doesn't make sense where your base time period is a whole second and you can't tell if the reason for no response is failure, busy network or no one listening. My guess is that they were just copying Ethernet, where modified, randomized exponential backoff is what makes it magically good. Exponential backoff makes sense at the microsecond level, where you have a collision domain and potentially 10,000 hosts on a shared ether. Even there the idea of "carrier sense" or 'is the network busy' is what enables Ethernet to work at 98+% utilization rather than the 18% or 37% theoretical of Aloha Net. (Key difference: deaf transmitter.) What usually happens with DHCP and PXE is that the first packet is used getting the NIC to transmit correctly. The second packet is used to get the switch to start passing traffic. The third packet get through but we are already well into the exponential fallback. PXE would be much better and more reliable if it started out transmitting a burst of four DHCP packets even spaced in the first second, then falling back to once per second. If there is a concern about DHCP being a high percentage of traffic in huge installations running 10baseT, tell them to buy a server. Or, like, you know, a router. Because later the ARP traffic alone will dwarf a few DHCP broadcasts. > I'm doing round-robin, but I don't see how to throttle active > connections: do I need to do that, or just limit total bandwidth? Yes, you need to throttle active TFTP connections. The clients currently winning can turn around a next-packet request really quickly. If a few get in lock step, the server will have the next chunk of the file warm in the cache. This is the start of locking out the first loser. You can't just let the ACKs queue up in the socket as a substitute for deferring responses either. You have to pull them out ASAP and mark that client as needing a response. This doesn't cost very much. You need to keep the client state structure anyway. This is just one more bit, plus updating the timeval that you should be keeping anyway. > >> It's maybe worth giving a bit of background here: dnsmasq is a > >> lightweight DNS forwarder and DHCP server. Think of it as being > >> equivalent to BIND and ISC DHCP with BIND mainly in forward-only mode > >> but doing dynamic DNS and a bit of authoritative DNS too. > > > > One of the things we have been lacking in Scyld has been an external DNS > > service for compute nodes. For cluster-internal name look-ups we > > developed BeoNSS. > Dnsmasq is worth a look. We likely can't leverage anything there. We already have a name system in BeoNSS. We just need the gateway from this NSS to DNS queries. > > BeoNSS uses the linear address assignment of compute nodes to > > calculate the name or IP address e.g. "Node23" is the IP address > > of Node0 + 23. So BeoNSS depends on the assignment policy of > > the PXE server (1). > To do that with dnsmasq you'll have to nail down the IP address > associated with every MAC address. .. > standard.) OTOH if you use dnsmasq to provide your name service you > might not need the linear assignment. I consider naming and numbering an important detail. The freedom to assign arbitrary names and IP addresses is a useful flexibility in a workstation environment. But for a compute room or cluster you want regular names and automatic-but-persistent IP addresses. We assign compute nodes a small integer node number the first time we accept them into the cluster. This is the node's persistent ID unless the administrator manually changes it. We used to allow node specialization based on MAC address as well as node number. The idea was the MAC address identified the specific machine hardware (e.g. extra disks or a frame buffer, server #6 of 16 in a PVFS array), while the node number might be used to specialize for a logical purpose. What we quickly found was that mostly-permanent node number assignment was a useful simplification. We deprecated using MAC specialization in favor of the node number being used for both physical and logical specialization. Just like you don't want your home address to change when a house down the street burns down, you don't want node IP addresses or node numbering to change. But you want automatic numbering when the street is extended or a new house is built on a vacant lot, with a manual override saying this house replaces the one that burnt down. [[ Do I get extra points for not using an automotive analogy? I can throw them away with "You don't care about the cylinder numbering in your car. But it's useful to have them numbered when you replace the spark plug cables." ]] > > (1) This leads to one of the many details that you have to get right. > > The PXE server always assigns a temporary IP address to new nodes. Once > > a node has booted and passed tests, we then assign it a permanent node > > number and IP address. Assigning short-lease IP addresses then changing a > > few seconds later requires tight, race-free integration with the DHCP > > server and ARP tables. That's easy with a unified server, difficult with > > a script around ISC DHCP. > > Is this a manifestation of the with-and-without-client-id problem? PXE > sends a client-id, but the OS doesn't, or vice-versa. Dnsmasq has nailed > down rules which work in most cases of this, mainly by trail-and-error. No, it's a different issue. PXE does have UUIDs, a universally unique ID that is distinct from MAC addresses. If you implement from the spec, you can use the UUID to pass out IP addresses and avoid the messiness of using the MAC address. I know I have the first machine built with the feature. It has the UUID with all zeros :-O. Then I have a whole bunch of other machines that must have been built for other universes because they have exactly the same all-zeros ID. Even when the UUID is distinct, it doesn't uniquely ID the machine. Different NICs on the same machine have different UUIDs, meaning you can not detect that it's the same machine you got a request from a few seconds ago. Bottom line: UUIDs are wildly useless. We address the multi-NIC case, along with a few others, by only assigning a persistent node number after the machine boots and runs a test program. The test program is elegantly simple: a Linux-based DHCP client. The request packets have an option field of all MAC addresses. (BTW, this is the same DHCP client code originally written to do PXE scalability tests.) -- Donald Becker [EMAIL PROTECTED] Scyld Software Scyld Beowulf cluster systems 914 Bay Ridge Road, Suite 220 www.scyld.com Annapolis MD 21403 410-990-9993 _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf