Re: [Beowulf] Cluster Metrics? (Upper management view)

Mark Hahn Fri, 20 Aug 2010 23:34:40 -0700

I think measuring a clusters success based on the number of jobs run
or cpu's used is a bad measure of true success.  I would be more
inclined to consider a cluster a success by speaking with the people
who use it and find out not only whether they can use it effectively
and/or what new science having cluster is being enabled by them.


now try that with a large user-base ;)

I think there are two broad categories of cluster: dedicated and shared.
dedicated clusters are easy: limited number of codes, users, etc.

straightforward metrics are appropriate, such as pend time (perhapsas a fraction of wallclock), job fail rates, fraction-of-peak measures.

past this, things are harder and fuzzier. we try pretty hard to getresearch outcomes from our users (lit citations, grants, grad student

and postdoc counts.)  we try other metrics too: trying to find researchers
who get and account, generate minimal usage, then stop ("frustrated").

for bigger, shared facilities, the simple metrics become less useful -for instance, pend:wallclock is meaningful as long as cluster contention

doesn't "shape" user behavior.  once users start reacting to contention
(by submitting fewer jobs, or maybe more), the metric's spoiled.

then only thing i find most of the below metrics overly useful is
figuring out whether or not we need a bigger cluster.  which i guess

it's a little hard to imagine a case where metrics wouldn't call fora larger cluster - does anyone really have persistently underutilizedclusters?

I also think you need to ask the "business" people what measure they
would consider a cluster as a worthwhile investment, it doesn't sound
as if you have that from your email.


my guess is that suits should be talked to about opportunity cost,

and not given a bunch of stats about utilization. that means you needto get some info from users about what they're doing. but alsoto figure out whether there's more they could do. and really, talking

to the users is important to do anyway.

clusters? ?Upper management is asking for us to define and provide
some sort of "numbers" which can be used to gage the success of our
cluster project.

take a look at your cluster stats: do you have different groups withbursty activity, but which interleaves on the cluster? that's obviouslybetter than multiple groups each having (probably smaller) clusterswith lower utilization over time...

- 90/95th percentile wait time for jobs in various queues. ?Is smaller
better meaning the jobs don't wait long and users are happy? ?Is


wait time is kind of tricky.  if you have low wait, then either the cluster
is underutilized, or it's magically rightsized (perhaps a perfectly steady,

predictable workload). once you have contention, the question is why -is there a user who queues 10k jobs every monday? do users submit chained(dependent) jobs, where the second is counted as waiting. do you havefairshare turned on, or any kind of static limits or partitioning?

- Availability during scheduled hours (ignoring scheduled maintenance
times). ?Common metric, but how do people actually measure/compute
this? ?What about down nodes? ?Some scheduled percentage (5%?) assumed
down?


I don't think it makes sense to obsess about this - yes, it's an easy number,
but it doesn't tell you much from the user's perspective.

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Cluster Metrics? (Upper management view)

Reply via email to