On Tue, 12 Dec 2006, Simon Kelley wrote: > Joe Landman wrote: > >>> I would hazard that any DHCP/PXE type install server would struggle > >>> with 2000 requests (yes- you arrange the power switching and/or > >>> reboots to stagger at N second intervals).
Those that have talked to me about this topic know that it's a hot-button for me. The limit with the "traditional" approach, the ISC DHCP server with one of the three common TFTP servers, is about 40 machines before you risk losing machines during a boot. With 100 machines you are likely to lose 2-5 during a typical power-restore cycle when all machines boot simultaneously. The actual node count limit is strongly dependent on the exact hardware (e.g. the characteristics of the Ethernet switch) and the size of the boot image (larger is much worse than you would expect). Staggering node power-up is a hack to work around the limit. You can build a lot of complexity into doing it "right", but still be rolling the dice overall. It's better than build a reliable boot system than to build a complex system around known unreliability. The right solution is to build a smart, integrated PXE server that understands the bugs and characteristics of PXE. I wrote one a few years ago and understand many of the problems. It's clear to me that no matter how you hack up the ISC DHCP server, you won't end up with a good PXE server. (Read that carefully: yes, it's a great DHCP server; no, it's not good for PXE.) > > fwiw: we use dnsmasq to serve dhcp and handle pxe booting. It does a > > marvelous job of both, and is far easier to configure (e.g. it is less > > fussy) than dhcpd. > > Joe, you might like to know that the next release of dnsmasq includes a > TFTP server so that it can do the whole job. The process model for the > TFTP implementation should be well suited to booting many nodes at once > because it multiplexes all the connections on the same process. My guess > is that will work better then having inetd fork 2000 copies of tftpd, > which is what would happen with traditional TFTP servers. Yup, that's a good start. It's one of the many things you have to do. You are already far ahead of the "standard" approach. Don't forget flow and bandwidth control, ARP table stuffing and clean-up, state reporting, etc. Oh, and you'll find out about the PXE bug that results in a zero-length filename.. expect it. > For ultimate scalability, I guess the solution is to use multicast-TFTP. > I know that support for that is included in the PXE spec, but I've never > tried to implement it. Based on prior experience of PXE ROMs, the chance > of finding a sufficiently bug-free implementation of mtftp there must be > fairly low. This is a good example of why PXE is not just DHCP+TFTP. The multicast TFTP in PXE is not multicast TFTP. The DHCP response specifies the multicast group to join, rather than negotiating it as per RFC2090. That means multicast requires communication between the DHCP and TFTP sections. > > Likely with dhcpd, not sure how many dnsmasq can handle, but we have > > done 36 at a time to do system checking. No problems with it. As part of writing the server I wrote a DHCP and TFTP clients to simulate high node count boots. But the harshest test was old RLX systems: each of the 24 blades had three NICs, but could only boot off of the NIC connected to the internal 100base repeater/hub. Plus the blade BIOS had a good selection of PXE bugs. Another good test is booting Itaniums (really DHCP+TFTP, not PXE). They have a 7MB kernel, and a similarly large initial ramdisk. Forget to strip off the kernel symbols and you are looking at 70MB over TFTP. (But they extend the block index from 16 to 64 bits, allowing you start a transfer that will take until the heat death of the universe to finish! Really, 32 bits is sometimes more than enough. Especially when extending a crude protocol that should have been forgotten long ago.) > Dnsmasq will handle DHCP for thousands of clients on reasonably meaty > hardware. The only rate-limiting step is a 2-3 second timeout while > newly-allocated addresses are "ping"ed to check that they are not in > use. That check is optional, and skipped automatically under heavy load, > so a large number of clients is no problem. > > > Cheers, > > Simon. > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Donald Becker [EMAIL PROTECTED] Scyld Software Scyld Beowulf cluster systems 914 Bay Ridge Road, Suite 220 www.scyld.com Annapolis MD 21403 410-990-9993 _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf