Now that there is a slurm-users mailing list, I thought I would share something with the community that I have been working on to see if anyone else is interested in it. I have a lot of students on my cluster and I really wanted a way to show my users how efficient their jobs are, or let them know that they are wasting resources.
I created a few scripts that leverage Graphite and whisper databases (RRD like) to gather metrics from Slurm jobs running in cgroups. The resolution for the metrics is defined by the retention interval that you specify in graphite. In my case I can store 1 minute metrics for CPU usage and Memory usage for the entire lifetime of a job. >From these databases, I have written scripts that can notify me if a user job is wasting resources, like requesting 64 cores when their application only scales to 8. I have also created a script that will allow a user to cURL a Grafana instance to graph their job metrics and create graphs. If anyone is interested I wrote something real quickly at: https://xathor.blogspot.com/2017/11/graphing-slurm-cgroup-job-metrics.html If there's interest I would be more than happy to polish the code a little and share it on github. I am also at SC17 if anyone wants to meet up and check it out in person. Thanks! --- Nicholas McCollum HPC Systems Administrator Alabama Supercomputer Authority