On Sat, 9 Dec 2006, Joe Landman wrote: > Guy Coates wrote: > > At what node count does the nfs-root model start to break down? Does > > anyone > > have any rough numbers with the number of clients you can support with a > > generic > > linux NFS server vs a dedicated NAS filer? > > If you use warewulf or the new perceus variant, it creates a ram disk > which is populated upon boot. Thats one of the larger transients. Then > you nfs mount applications, and home directories. I haven't looked at > Scyld for a while, but I seem to remember them doing something like this.
I forgot to finish my reply to this message earlier this week. Since I'm in the writing mood today, I've finished it. Just when were getting past "diskless" being being misinterpreted as "NFS root"... Scyld does use "ramdisks" in our systems, but calling "ramdisk based" misses the point of the system. Booting: RAMdisks are critical Ramdisks are a key element of the boot system. Clusters need reliable, stateless node booting. We don't want local misconfiguration, failed storage hardware or corrupted file systems to prevent booting. The boot ramdisk have to be small, simple, and reliable. Large ramdisks multiply PXE problems and have obvious server scalability issues. Complexity is bad because the actions are happening "blind", with no easy way to see what step when wrong. We try to keep these images stable, with only ID table and driver updates. Run-time: RAMdisks are just an implementation detail The run-time system uses ramdisks almost incidentally. The real point of our system is creating a single point of administration and control -- a single virtual system. To that end we have a dynamic caching, consistent execution model. The real root "hypervisor" operates out of a ramdisk to be independent of the hardware and storage that might be used by application environments. The application root and caching system default to using ramdisks, but they can be configured to use local or network storage. The "real root" ramdisk is pretty small and simple. It's never seen by the applications, and only needs to keeps it own housekeeping info. The largest ramdisk is the system is the "libcache" FS. This cache starts out empty. As part of the node accepting new applications, the execution system (BProc or BeoProc) verifies that correct version of executable and libraries are available locally. By the time the node says "yah, I'll accept that job" it has cached the exact version it needs to run. (*) So really we are not using a "ramdisk install". We are dynamically detecting hardware, and loading the right kernel and device drivers under control of the boot system. Then we are creating an minimal custom "distribution" on the compute nodes. The effect is the same as creating a minimal custom "distribution" for that specific machine -- an installation that has only the kernel, device drivers and applications to be run on that node. This approach to dynamically building an installation is feasible and efficient because another innovation: a sharp distinction between full, standard "master" nodes and lightweight compute "slave" nodes. Only master nodes run the full, several-minute initialization to start standard services and daemons. ("How many copies of crond do you need?") Compute slaves exist only run only the end applications, and have a master with it's full reference install to fall on when they need to extend their limited environment. * Whole file caching is one element of the reliability model. It means we can continue to run even if that master stops responding, or replaces a file with a newer version. We provide a way for sophisticated sites to replace the file cache with a network file system, but then the file server must be up to continue running and you can run into versioning/consistency issue. RAMdisk Inventory We actually have five (!) different types of ramdisks over the system (see the descriptions below). But it's the opposite of the Warewulf approach. Our architecture is a consistent system model, so we dynamically build and update the environment on nodes. Warewulf-like ramdisk system only catch part of what we are doing: The Warewulf approach - Uses a manually selected subset distribution on the compute node ramdisk. While still very large, it's never quite complete. No matter how useless you think some utility is, there is probably some application out there that depends on it. - The ramdisk image is very large and it has to be completely downloaded at boot time just when the server is extremely. - Supplements the ramdisk with NFS, combining the problems of both.(*) The administrator and users to learn and think about how both fail. (*1) That said, combining a ramdisk root with NFS is still far more scalable and somewhat more robust than using solely NFS. With careful administration most of the executables will be on the ramdisk, allowing the server to support more nodes and reducing the likelihood of failures. The phrase "careful administration" should be read as "great for demos, and when the system is first configured, but degrades over time". The type of people that leap to configure the ramdisk properly the first time are generally not the same type that will be there for long-term manual tuning. Either they figure out why we designed around dynamic, consistent caching and re-write, or the system will degrade over time. Ramdisk types For completeness, here are the five ramdisk types in Scyld: BeoBoot stage 1: (The "Booster Stage") Used only for non-PXE booting. Now obsolete, this allowed network booting on machines that didn't have it built in. The kernel+ramdisk was small enough to fit on floppy, CD-ROM, hard disk, Disk-on-chip, USB, etc. This ramdisk image that contains NIC detection code and tables, along with every NIC driver and a method to substitution kernels. This image must be under 1.44MB, yet include all NIC drivers. BeoBoot stage 2 ramdisk: The run-time environment set-up, usually downloaded by PXE. Pretty much the same NIC detection code as the stage 1 ramdisk, except potentially optimized for only the NICs known to be installed. The purpose of this ramdisk is to start network logging ASAP and then contact the master to download the "real" run-time environment. When we have the new environment we pivotroot and delete this whole ramdisk. We've used the contents we cared about (tables & NIC drivers), and just emptying ramdisks frequently leaks memory! It's critical that this ramdisk be small to minimize TFTP traffic. Stage 3, Run-time environment supervisor (You can call this the "hypervisor".) This is the "real" root during operation, although applications never see it. The size isn't critical because we have full TCP from stage 2 to transfer it, but it shouldn't be huge because - it will compete with other, less robust booting traffic - the master will usually be busy - large images will delay node initialization LibCache ramdisk: This is a special-purpose file system used only for caching executables and libraries. We designed the system with a separate caching FS to optionally switch to caching on a local hard disk partition. That was useful with 32MB memory machines or when doing a rapid large-boot demo, but the added complexity is rarely useful on modern systems. Environment root: This is the file system the application sees. There is different environment for each master the node supports, or potentially even one for each application started. By default this is a ramdisk configured as a minimal Unix root by the master. The local administrator can change this to be a local or network file system to have a traditional "full install" environment, although that discards some of the robustness advances in Scyld. > Scyld requires a meatier head node as I remember due to its launch model. Not really because of the launch model, or the run-time control. It's to make the system less complex and simpler to use. Ideally the master does less work than the compute nodes because they are doing the computations. In real life people use the master for editing, compiling, scheduling, etc. It's the obvious place to put home directories and serve them to compute nodes. And it's where the real-life cruft ends up, such as license servers and reporting tools. Internally each type of service has it's own server IP address and port. We could point them to replicated masters or other file servers. They just all point to the single master to keep things simple. For reliability we can have cold, warm or hot spare masters. But again, it's less complex to administer one machine with redundant power supplies and hot-swap RAID5 arrays. All this makes the master node look like the big guy. -- Donald Becker [EMAIL PROTECTED] Scyld Software Scyld Beowulf cluster systems 914 Bay Ridge Road, Suite 220 www.scyld.com Annapolis MD 21403 410-990-9993 _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf