I think measuring a clusters success based on the number of jobs run
or cpu's used is a bad measure of true success. I would be more
inclined to consider a cluster a success by speaking with the people
who use it and find out not only whether they can use it effectively
and/or what new science having cluster is being enabled by them.
now try that with a large user-base ;)
I think there are two broad categories of cluster: dedicated and shared.
dedicated clusters are easy: limited number of codes, users, etc.
straightforward metrics are appropriate, such as pend time (perhaps
as a fraction of wallclock), job fail rates, fraction-of-peak measures.
past this, things are harder and fuzzier. we try pretty hard to get
research outcomes from our users (lit citations, grants, grad student
and postdoc counts.) we try other metrics too: trying to find researchers
who get and account, generate minimal usage, then stop ("frustrated").
for bigger, shared facilities, the simple metrics become less useful -
for instance, pend:wallclock is meaningful as long as cluster contention
doesn't "shape" user behavior. once users start reacting to contention
(by submitting fewer jobs, or maybe more), the metric's spoiled.
then only thing i find most of the below metrics overly useful is
figuring out whether or not we need a bigger cluster. which i guess
it's a little hard to imagine a case where metrics wouldn't call for
a larger cluster - does anyone really have persistently underutilized
clusters?
I also think you need to ask the "business" people what measure they
would consider a cluster as a worthwhile investment, it doesn't sound
as if you have that from your email.
my guess is that suits should be talked to about opportunity cost,
and not given a bunch of stats about utilization. that means you need
to get some info from users about what they're doing. but also
to figure out whether there's more they could do. and really, talking
to the users is important to do anyway.
clusters? ?Upper management is asking for us to define and provide
some sort of "numbers" which can be used to gage the success of our
cluster project.
take a look at your cluster stats: do you have different groups with
bursty activity, but which interleaves on the cluster? that's obviously
better than multiple groups each having (probably smaller) clusters
with lower utilization over time...
- 90/95th percentile wait time for jobs in various queues. ?Is smaller
better meaning the jobs don't wait long and users are happy? ?Is
wait time is kind of tricky. if you have low wait, then either the cluster
is underutilized, or it's magically rightsized (perhaps a perfectly steady,
predictable workload). once you have contention, the question is why -
is there a user who queues 10k jobs every monday? do users submit chained
(dependent) jobs, where the second is counted as waiting. do you have
fairshare turned on, or any kind of static limits or partitioning?
- Availability during scheduled hours (ignoring scheduled maintenance
times). ?Common metric, but how do people actually measure/compute
this? ?What about down nodes? ?Some scheduled percentage (5%?) assumed
down?
I don't think it makes sense to obsess about this - yes, it's an easy number,
but it doesn't tell you much from the user's perspective.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf