Re: [Beowulf] While the knives are out... Wulf Keepers

Mike Davis Mon, 21 Aug 2006 15:12:32 -0700




Robert G. Brown wrote:


Remember I'm just such a type as well.  So are a whole lot of primary
contributers on this list.  Building and USING a cluster to perform
actual work provides one with all sorts of real world experience that
goes into building your next one, or helping others to do so.  Many
people -- e.g. Greg Lindahl or Joe L. or Jim L.  -- seem to interpolate
these worlds and more and use clusters to do research, engineer and
manage clusters, do corporate stuff with or for clusters.

As would I. But I would prefer that they install their version andlibraries in their directories rather than the universal directories.

Rather what I think he's saying is that in a large cluster environment
where there are many and diverse user groups sharing an extended
resource, careless management can cost productivity -- which is
absolutely true.  Examples of careless management certainly include
thoughtlessly updating some mission-critical library to solve a problem
for group A at the expense of breaking applications for groups B and C,
but this can actually be done just as easily by a professional
administrator as by a research group.  The only difference is that a
"cluster administrator" is usually professionally charged with not being
so careless and with having the view of the hand and the time to
properly test things and so on.  A good cluster administrator takes this
responsibility seriously and may well seek to remain in firm control of
updates and so on in order to accomplish this.

I exercise extreme caution on running research machines. For example, myolder clusters have both g98 and g03 installed. This is done to avoiddifferences possible between the different versions in long runningprojects. Researchers have the choice of which versions they will use.In general I do the same thing with Gammess, BLAST etc. I will install anew version rather than immediately upgrade an existing version if thereis no specific need to do so.


As you observe, ultimately this comes down to good communications and
core competence among ALL people with root-level access for ANY LAN
operation (not just cluster computing -- you can do the exact same thing
in any old LAN).  There are many ways to enforce this -- fascist topdown
management by a competent central IT group where they permit "no" direct
user management of the cluster; completely permissive management where
each group talks over any changes likely to affect others but retains
privileges to access and root-manage at least the machines that they
"own" in a collective cluster (yes, this can work and work well and is
in fact workING in certain environments right now), something like COD
whereby any selected subcluster can be booted in realtime into a user's
own individually developed "cluster node image" via e.g. DHCP so that
while you're using the nodes you TOTALLY own them but cannot screw up
access to those same nodes when OTHER people boot them into THEIR own
image, and lots more besides including topdown not-quite-so-conservative
management (which is probably the norm).

Many of us struggle with this. There are certainly good resons forindividual images. If people like Joe, and Greg and Jim are making thoseimages I feel pretty good about it.

One way that we resolve this issue is with a technology refresh ratethat attempts to upgrade the computational infrastructure on a 3 yearcycle. For example, 2 years ago we purchased a 64p xeon cluster. Lastyear we cooperated with Physics researchers to purchase a 200p OpteronCluster. This year we are bringing up a 100p addition to the Opterons.Next year, we will add ~300 processors and begin the retirement of theoriginal 64p xeon cluster. But we will still have gone from 64 to 600processors over that three year period.

At a guess, Really Big Clusters -- ones big enough to have a full time
administrator or even an administrative group -- are going to strongly
favor topdown fascist adminstration as there are clear lines of
responsibility and a high "cost" of downtime.  For these to be
successful there have to be equally firm open lines of communication, so
that researchers work is (safely and competently) enabled regardless of
the administration skills of members of any group.  Larger shared
corporate clusters are also likely to very often fall into this
category, although there are also many exceptions I'm sure at the
workgroup level.  Small research-group owned clusters are likely as not
to be locally owned and operated even today.  In between you're bound to
see almost anything.

Communication is definitely paramount. I meet with researchers, PI's andpostdocs almost daily. Part of that is simple outreach. Showing peoplethe new capabilities that we have and the performance improvements thatthey can expect. Part of it is building the trust that is necessary tobuild the kind of cooperation that helps the Institution.

The "high cost" of downtime is very real. Unexpected downtime will bringcalls to the Provost. I work to minimize that unexpected downtime. And,its not easy.

Last week an issue with an upgrade to the University backbone cost oneof my subnets external communication for 16 hours. There was no affecton running jobs, but no way for users to login between 6pm and 10am.That's not good. The network change notification specified intermittentshort outages as the upgrade progressed. Not a complete loss of comm for16 hours. When the subnet had been down for 3 hours, I began emailingusers to let them know the situation and what was being done to correctit. Two clusters reside on that one subnet. Other clusters including ouropterons were not effected.

But there is no doubt that we must all communicate and cooperate to makethings work in bother the big and small pictures.



Mike Davis

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] While the knives are out... Wulf Keepers

Reply via email to