On Sat, 5 Apr 2008, Anand Vaidya wrote: > On Fri, Apr 4, 2008 at 6:19 PM, Geoff Galitz <[EMAIL PROTECTED]> wrote: > > Having said that, I think that the Linux clustering scene needs a little > competition, especially the for-fee ones. Apart from SDSC, not many > innovations are happening.
> I am not referring to standalone projects, where > FOSS community has a lot of innovatio happening, but rather one integrated > Linux Cluster on a DVD that gets you a cluster ready in 20mins, with no > pain at all. ROCKS comes with its own problems, esp, wrt updates (which is > why I stopped using ROCKS), however they are working on this one, AFAIK. I delayed responding to this, since I expected that someone else talk about it. (Joe did, but only a little.) Scyld published the first "cluster on a disk" in late 2000. It was a single install disk that asked two or three extra questions over a standard Linux install, installed in about the same time as the underlying distribution (essentially RedHat, so about 20 minutes) and could boot about 500 slave nodes in about a minute over Fast Ethernet.(1) Two years later our demo version was a live CD, so it zero install on the master as well. A single live CD could boot a 1000 node cluster and run one of a few toy apps. A sad thing for me is that can no longer publish a similar CD. We were heavily marketed against for being an integrated system, even to the point of implying Scyld wasn't really Linux. In the end we had to change how we deliver our system to make it clear. Today we have a two step install, starting with a generic Linux distribution (typically CentOS or RHEL) and later adding our packages. With this packaging we can longer have a live CD that acts the same as the installed version. (1) A drawback of machines in that era was that they didn't have network booting built in. We had to invent our own network booting system, BeoBoot, and have it support every possible network adapter. Operationally it was a PITA since it required every node too first boot off of a floppy, CD, flash or tiny hard disk partition. So you first had to write/burn a bunch of floppy/CD-R disks, and reading them delayed booting so that it was more like 2 minutes to boot 100 slave nodes. We put a bunch of effort into making this boot process admin-free. The BeoBoot system uses a stable kernel to download the operational version from the master. This both makes updating the kernel a single-point effort, as well as eliminating the risk of making the whole cluster unbootable with a flawed update. > So, here's what the FOSS community, especially, vendors (RH, Novell) should > be doing, specifically for a HPC oriented version: > > - remove all unwanted packaged (desktop software, multimedia, web browsers > etc) We have a better way, driven by long experience. Don't go through the error-prone process of figuring out a minimal system. (Modern RPM systems will pull in almost everything anyway, yet still omit a critical tool.) Instead to a full, standard install and configuration on the master and use it as your reference. For the compute nodes start from zero, and build from there. First, use the network boot system to figure out what kernel they should run, and have the master pass it that kernel plus the network driver. Then the master asks the node what hardware it has, and uses its local configuration to figure out what kernel modules plus configuration info are needed to support. Then whenever you start a job on the compute node, verify that it has the currently correct version of the executable and libraries. If it doesn't (and "I got nothin" is the same as having the wrong version), copy it over. Don't page it in, which results in unpredictable performance. Just do a single transfer and cache the whole executable/library to linear memory. It's the application you are about to run, and with a zero install the node is only running compute applications, so you won't be wasting memory. > - package SGE, Ganglia, Pretty much a given... you need at least a mapper-scheduler and a monitoring system. You can do slightly better than these, but it's easier to do much worse. > - a good clustering toolkit, maybe derived from ROCKS scripts (I am biased > towards IBM xcat, because that is the only tool I use) Why point to Rocks as an example? Like so many other "cluster systems" it's a non-architecture, an ad hoc system. It's a packaging and support exercise, not innovation. It's a giant step back to the Windows world when simple administration, such as adding new nodes, is done by re-installation. > - LDAP as the default auth source, setup SSH for clusterwide passwordless > logins by default Both high-cost, sub-optimal choices for normal operation. We implement a cluster-specific name service that handles most name queries very quickly, and pointedly without network transactions. We fall back to other services only when the application asks about external things e.g. other users or non-cluster hosts. (We recently added our own network-fallback service so that the master can resolve these without configuring NIS/LDAP/AD on compute nodes.) Standard 'ssh' is slow to start jobs, and not precise about the environment and executables. You solve the first problem by building persistent network connections between the head and compute nodes, authenticating only once. > - package a selection of top20 FLOSS science apps (Gromacs, Phylip, Blast, > MPICH, fasta, fftw etc) Libraries are mostly easy. We automate what we can, but we have learned that the interesting apps require human configuration or tuning. > - package and provide one click installation for restricted-ware such as > NAMD, or commercial software such as Intel Compilers, Fluent, Amber etc. It > CAN be done, Ubuntu has demonstrated how to do it well. We've done this by providing demo-license versions where possible, such as with the Intel compilers. But most HPC ISVs don't have the resources to be flexible. I don't see any HPC distribution+app installation being as easy as Ubuntu for at least a few years. Even if we jump up and down and point, screaming "It's easy. They do it. They even show to do it." > - package and provide easy install of a parallel filesystem such as GFS or > Lustre We shipped integrated PVFS starting with our second release, including funding the PVFS guys to make it easy to configure. Over the years we have included a few others, but preconfigured support for advanced distributed and cluster file systems hasn't justified the effort and cost. We now sometimes include the kernel modules, but configuration is done as a professional service or by customers that are already experts. -- Donald Becker [EMAIL PROTECTED] Penguin Computing / Scyld Software www.penguincomputing.com www.scyld.com Annapolis MD and San Francisco CA _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf