A general question: What're folks using for stats, including queue wait, execution times, hours/month? Any suggestions?

we run ~20 clusters, some large, and collect all the stats to a single db,
with a custom web interface, etc. users and PI's can see tables and graphs of usage. we don't by default do anything with per-job pend times,
though it's there.  we also don't do anything with hours/month - the closest
would be graphs which show ncpus across time (ie, if over the past 2 weeks,
the y-axis would probably be cpu-hours-per-hour, summed over all jobs,
but possibly partitioned by user/cluster/queue/etc).

I don't know how much this code/etc would be of interest to anyone else.
I at least, have not talked to other cluster people who have quite the same take on issues. for instance, each of our jobs is stored with user (we have a single ldap), command, cluster, queue, flags, pend time,
seconds allocated/utime/stime.  users are either sponsor (PI) or sponsored,
and there's another level of ID intended to harmonize with a pan-Canadian
"people" database.  the current database receives job info from a variety
of schedulers - RMS on our original Alphas, LSF, my opensource minimalist scheduler, torque/maui and SGE. having a comprehensive DB like this has led to some interesting optimizations having to do with shipping batches of job records around (cron, ssh, rsync, etc), or ways of binning usage to make it feasible to generate dynamic graphs of usage.

if you're OK with a per-cluster interface, aren't nagios and similar packages pretty interchangable?
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to