Re: [Beowulf] Compute Node OS on Local Disk vs. Ram Disk

Eric Thibodeau Tue, 30 Sep 2008 17:49:10 -0700

Jon

I'm replying to Don's post since he outlines most of the reasons whyI choose to use the NFS-mounted approach and let you choose weather ornot you want a local disk(s) for scratch. Which brings up the _real_questions:


- how many nodes
- are they all identical
- how many users concurrently using the cluster?

- do you have assigned full-time staff responsible for the cluster (asin, hired in-house staff that will be there to maintain the cluster).

As an example, I'm a student managing a cluster at our departmentand converted it from disk-based RH 7.3 to NFS-booted Gentoo nodes. Thishas given me much flexibility and a very fast path to upgrade the nodes(LIVE!) since they would only need to be rebooted if I changed thekernel. I can install/upgrade the node's environment by simply chrootinginto it and using the node's package manager and utilities as if it werea regular system). But I am in a special case where, if I break thecluster, I can fix it quickly and I always have a backup copy of theboot "root" image ready to switch to if my fiddling goes wrong. Thisalso implies users aren't on the cluster when I arbitrarily decide tochange the compiler from GCC-4.1.1 to GCC-4.3.2 ;) Hence the few pointsI mention above and the weighted importance of each of them.

This said, one thing I haven't seen (explicitly) mentioned in allthe replies is that you don't need a 1 to 1 correlation of OS/RAM, thisis where you use Unionfs (or aufs) + an NFS-mounted root. I am currentlywriting up a document on how I accomplish this (Gentoo ClusteringLiveCD), I'll give you a link to the beta version of the document if youwant. The section describing the SSI (Single System Image) gives moredetails of what is discussed here.


Eric

Donald Becker wrote:

On Sun, 28 Sep 2008, Jon Forrest wrote:
There are two philosophies on where a compute node's
OS and basic utilities should be located:
1) On a local harddrive
2) On a RAM disk
I'd like to start a discussion on the positives
and negatives of each approach. I'll throw out
a few.

Both approaches require that a compute node "distribution"
be maintained on the frontend machine. In both cases
it's important to remember to make any changes to this
distribution rather than just using "pdsh" or "tentakel"
to dynamically modify a compute node. This is so that the
next time the compute node boots, it gets the uptodate
distribution.
Ahhh, your first flawed assumption.

You believe that the OS needs to be statically provisioned to the nodes.
That is incorrect.

A compute node only needs what it will actually be running
  - a kernel and device drivers that match the hardware
  - kernel support for non-hardware-specific features (e.g. ext3 FS)
  - a file system that presents a standard application environment
(The configuration files that the libraries depend upone.g. a few files in /etc/*, a /dev/* that matches the hardware,
     a few misc. directories)
  - the application executable and libraries it links against
  - application-specific file I/O environment (usually /tmp/ and a
    few data directories)
You can detect the first and most of the second category at node boottime. The kernel is loaded into memory and kernel modules areimmediately linked in, so there isn't any reason to keep them around as afile system.
The third category does need to be a file system, but it's tiny andchanges infrequently. It can easily provisioned, or even dynamicallycreated, at node boot.
The fourth category is an interesting one. You don't have to staticallyprovision it at boot time, or mount a network file system. When you issuea process to a node, the system that accepts the process can check thatit has the needed executable and libraries. Better, it can verify that ithas the correct versions. And this is the best time to check, because wecan ask the sending machine for a current copy if we don't have thecorrect version. By having a model for "execution correctness" wesimultaneously eliminate one source of version skew and eliminate the needto pre-load executables and libraries that will be unused or updatedbefore use. Plus we automatically have a way to handling newly addedapplications, libraries and utilities without rebooting compute nodes.
Assuming the actual OS image is the same in both cases,
#2 clearly requires more memory than #1.
No, it can require substantially less.  It only requires more if you
assume the naive approach of building a giant RAMdisk with everything you
might need.  If you think of an alternative model where you are just
caching the elements needed to do a job, the memory usage is less.
Think of a compute node as part of a cluster, not a stand-alone machine.The only times that it is asked to do something new (boot, accept a newprocess) it's communicating with a fully installed, up-to-date masternode. It has, at least temporarily, complete access to a referenceinstall. It can take that opportunity to cache or load elements thatdoesn't have, or has an obsolete version of.
There might be some dynamic elements needed later e.g. name servicelook-ups, but these should be much smaller than the initial provisioningand the correct/consistency model is inherently looser.
Long ago not installing a local harddrive saved a considerable
about of money but this isn't true anymore. Systems that need
to page (or swap) will require a harddrive anyway since paging
over the network isn't fast enough so very few compute nodes
will be running diskless.
The hardware cost of a local hard drive wasn't really an issue. It hasalways been the least expensive I/O bandwidth available. The real cost isinstalling, updating and backing up the drive. If you design a clustersystem that installs on a local disk, it's very difficult to adapt it todiskless blades. If you design a system that is as efficient withoutdisks, it's trivial to optionally mount disks for caching, temporary filesor application I/O.
Approach #2 requires much less time when a node is installed,
and a little less time when a node is booted.
We've been able to start diskless compute nodes in
  <BIOS memory count> + <PXE 2 seconds> + 750 milliseconds  (!)

To be fair, that was on blades without disk controllers, and just
Ethernet.  Scanning for local disks, especially with a SCSI layer, can
take many seconds.  Once you detect a disk it takes a bunch of slow seeks
to read the partition table and mount a modern file system (not EXT2).So trimming the system initialization time further isn't a priority untilafter the file system and IB init times are shortened.

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Compute Node OS on Local Disk vs. Ram Disk

Reply via email to