On Sat, 19 Apr 2008, Jan Heichler wrote: > BC> But then I changed my mind when I started > BC> to hear what a great feature it is to have several nodes booting and > BC> installing the OS in the same 50 minutes (yes, minutes!) that a single > BC> node takes, due to a wonderful feature called multicast. > > 50 minutes for a single node is of course unacceptable. 50 Minutes for > 256 nodes is okay i think.
With Scyld we do dynamic provisioning and "diskless administration", so all run-time elements come from a master with each boot. That means we a great deal of effort into making "installation" fast. We have had releases that take 0.750 seconds before a compute node is ready to accept its first job. Yes, that's under a second to start a kernel, activate the network, set up a connection to the cluster master, and configure an application environment. (Of course that's ignoring the time in the BIOS counting memory and PXE's two second delay. And that node hardware had no disks and few devices, the kernel was 2.4 which started faster, and everything software setting was left at the default e.g. no mounted filesystems or extra services. A more typical start-up time is 5-15 seconds.) But in my experience, even 10 seconds is too long for a scalable cluster architecture. Why? Because 100 or 1000 nodes is a really big number. Consider that a untuned master/boot-server spends about 20-25% of its focus on that node during the 10 seconds. Or, just for approximation, we can boot 4-5 nodes at once. That means it takes 3-5 minutes to bring 100 nodes up, and it's unbearably long to bring 1000 nodes up without auxiliary servers. That's why we redesigned part of our boot system when we started to take 10 seconds. And it's why I consider full installation to be unworkable for large clusters, especially when re-installation is considered to be part of cluster administration. > But i doubt that it scales that well. Even Multicast packages get lost - > needs retransmission etc. This pushed the hot button that really triggered my response. We have multicast options in many parts of our system. But they are always turned off. Multicast is a parlor trick, like balancing 10 plates on your nose while riding a unicycle. It makes for a great show, but you aren't served that way at your local diner. We made the mistake once.. a single release.. never to be repeated.. of turning on multicast by default. It worked for us, with small unmanaged switches in the test lab. It broke when we sent the release to customers. And it broke in a different way for each configuration. The problem with multicast isn't how well it performs when it works, it's what happens when things go wrong. Designing a multicast protocol requires knowing the characteristics of the transmission media. What happens when a packet is lost? Is it lost for everyone? Do you lose packets one at a time or in bursts? Do you lose them in small bursts or long bursts? Are losses equally spaced or randomly distributed? Switches make different choices about discarding packets when overloaded. (About the only common choice they make is that multicast packets get tossed first.) They change loss behavior as the load changes. Cascading switches, including multiple switch chips in a single box, multiplies the number of ways they can behave. The more you look at the problem, the more you understand that's designing even simple multicast protocols is very similar to designing error correcting codes (a very difficult problem), and then we toss complexity on top of that. Out of order packets, feedback-generated losses (where sending "please resend packet N" caused more loss), etc. Hmmm, this probably more readable if I stick a list in: A few problems with multicast bulk transfer are It slows the protocol to TFTP speed It discards the ability to use TCP transmit offload hardware and software fast-path receive. Many switches prefer to drop multicast packets, presuming that it is low-priority traffic. This is especially true when multicast is handled by software on the switch, with the embedded CPU sized assuming a typical environment with very little multicast traffic. Multicast packets drops may be pattern based, leaving some transfers persistently incomplete Multicast filtering is imperfect on most host NICs, resulting in a hidden CPU cost and unpredictable performance on machines not participating in the transfer. Why do some people think that multicast is a good idea? It might be just a good example of lessons learned not being unlearned. Using multicast was an excellent solution in the days of Ethernet coax and repeaters. When using a repeater all machines see all traffic anyway, and older NICs treated all packets individually. Today clusters use switched Ethernet or specialized interconnects (Infiniband, Myrinet, etc.), all of which must handle multicast packets as a special case and emulate the behavior. -- Donald Becker [EMAIL PROTECTED] Penguin Computing / Scyld Software www.penguincomputing.com www.scyld.com Annapolis MD and San Francisco CA _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf