Robert G. Brown wrote:
Remember I'm just such a type as well. So are a whole lot of primary
contributers on this list. Building and USING a cluster to perform
actual work provides one with all sorts of real world experience that
goes into building your next one, or helping others to do so. Many
people -- e.g. Greg Lindahl or Joe L. or Jim L. -- seem to interpolate
these worlds and more and use clusters to do research, engineer and
manage clusters, do corporate stuff with or for clusters.
As would I. But I would prefer that they install their version and
libraries in their directories rather than the universal directories.
Rather what I think he's saying is that in a large cluster environment
where there are many and diverse user groups sharing an extended
resource, careless management can cost productivity -- which is
absolutely true. Examples of careless management certainly include
thoughtlessly updating some mission-critical library to solve a problem
for group A at the expense of breaking applications for groups B and C,
but this can actually be done just as easily by a professional
administrator as by a research group. The only difference is that a
"cluster administrator" is usually professionally charged with not being
so careless and with having the view of the hand and the time to
properly test things and so on. A good cluster administrator takes this
responsibility seriously and may well seek to remain in firm control of
updates and so on in order to accomplish this.
I exercise extreme caution on running research machines. For example, my
older clusters have both g98 and g03 installed. This is done to avoid
differences possible between the different versions in long running
projects. Researchers have the choice of which versions they will use.
In general I do the same thing with Gammess, BLAST etc. I will install a
new version rather than immediately upgrade an existing version if there
is no specific need to do so.
As you observe, ultimately this comes down to good communications and
core competence among ALL people with root-level access for ANY LAN
operation (not just cluster computing -- you can do the exact same thing
in any old LAN). There are many ways to enforce this -- fascist topdown
management by a competent central IT group where they permit "no" direct
user management of the cluster; completely permissive management where
each group talks over any changes likely to affect others but retains
privileges to access and root-manage at least the machines that they
"own" in a collective cluster (yes, this can work and work well and is
in fact workING in certain environments right now), something like COD
whereby any selected subcluster can be booted in realtime into a user's
own individually developed "cluster node image" via e.g. DHCP so that
while you're using the nodes you TOTALLY own them but cannot screw up
access to those same nodes when OTHER people boot them into THEIR own
image, and lots more besides including topdown not-quite-so-conservative
management (which is probably the norm).
Many of us struggle with this. There are certainly good resons for
individual images. If people like Joe, and Greg and Jim are making those
images I feel pretty good about it.
One way that we resolve this issue is with a technology refresh rate
that attempts to upgrade the computational infrastructure on a 3 year
cycle. For example, 2 years ago we purchased a 64p xeon cluster. Last
year we cooperated with Physics researchers to purchase a 200p Opteron
Cluster. This year we are bringing up a 100p addition to the Opterons.
Next year, we will add ~300 processors and begin the retirement of the
original 64p xeon cluster. But we will still have gone from 64 to 600
processors over that three year period.
At a guess, Really Big Clusters -- ones big enough to have a full time
administrator or even an administrative group -- are going to strongly
favor topdown fascist adminstration as there are clear lines of
responsibility and a high "cost" of downtime. For these to be
successful there have to be equally firm open lines of communication, so
that researchers work is (safely and competently) enabled regardless of
the administration skills of members of any group. Larger shared
corporate clusters are also likely to very often fall into this
category, although there are also many exceptions I'm sure at the
workgroup level. Small research-group owned clusters are likely as not
to be locally owned and operated even today. In between you're bound to
see almost anything.
Communication is definitely paramount. I meet with researchers, PI's and
postdocs almost daily. Part of that is simple outreach. Showing people
the new capabilities that we have and the performance improvements that
they can expect. Part of it is building the trust that is necessary to
build the kind of cooperation that helps the Institution.
The "high cost" of downtime is very real. Unexpected downtime will bring
calls to the Provost. I work to minimize that unexpected downtime. And,
its not easy.
Last week an issue with an upgrade to the University backbone cost one
of my subnets external communication for 16 hours. There was no affect
on running jobs, but no way for users to login between 6pm and 10am.
That's not good. The network change notification specified intermittent
short outages as the upgrade progressed. Not a complete loss of comm for
16 hours. When the subnet had been down for 3 hours, I began emailing
users to let them know the situation and what was being done to correct
it. Two clusters reside on that one subnet. Other clusters including our
opterons were not effected.
But there is no doubt that we must all communicate and cooperate to make
things work in bother the big and small pictures.
Mike Davis
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf