Re: [Beowulf] user stats on clusters

Mark Hahn Fri, 27 Feb 2009 14:31:53 -0800

A general question: What're folks using for stats, including queue wait,execution times, hours/month? Any suggestions?


we run ~20 clusters, some large, and collect all the stats to a single db,

with a custom web interface, etc. users and PI's can see tables andgraphs of usage. we don't by default do anything with per-job pend times,

though it's there.  we also don't do anything with hours/month - the closest
would be graphs which show ncpus across time (ie, if over the past 2 weeks,
the y-axis would probably be cpu-hours-per-hour, summed over all jobs,
but possibly partitioned by user/cluster/queue/etc).


I don't know how much this code/etc would be of interest to anyone else.

I at least, have not talked to other cluster people who have quite thesame take on issues. for instance, each of our jobs is stored withuser (we have a single ldap), command, cluster, queue, flags, pend time,

seconds allocated/utime/stime.  users are either sponsor (PI) or sponsored,
and there's another level of ID intended to harmonize with a pan-Canadian
"people" database.  the current database receives job info from a variety

of schedulers - RMS on our original Alphas, LSF, my opensource minimalistscheduler, torque/maui and SGE. having a comprehensive DB like this hasled to some interesting optimizations having to do with shipping batchesof job records around (cron, ssh, rsync, etc), or ways of binningusage to make it feasible to generate dynamic graphs of usage.

if you're OK with a per-cluster interface, aren't nagios and similarpackages pretty interchangable?

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] user stats on clusters

Reply via email to