You can also use the influxdb profiling plugin I developed that’s included in
the latest slurm version. It will provide live cpu and memory usage per task,
step, host and job. You can then provide a grafana dashboard to display the
live metrics
Regards,
Carlos
Sent from my iPhone
> On 9 Dec 2
Would job profiling with HDF5 work as well?
https://slurm.schedmd.com/hdf5_profile_user_guide.html
Jacob
On Sun, Dec 9, 2018 at 4:17 PM Sam Hawarden
wrote:
> Hi Aravindh
>
> For our small 3 node cluster I've hacked together a per-node python script
> that collects current and peak cpu, memory
Hi Aravindh
For our small 3 node cluster I've hacked together a per-node python script that
collects current and peak cpu, memory and scratch disk usage data on all jobs
running on the cluster and builds a fairly simple web-page based on it. It
shouldn't be hard to make it store those data poin
For the simpler questions (for the overall job step, not real-time), you can
'sacct --format=all’ to get data on completed jobs, and then:
- compare the MaxRSS column to the ReqMem column to see how far off their
memory request was
- compare the TotalCPU column to the product of the NCPUS and El
This is the idea behind XDMod's SUPReMM. It does generate a ton of data
though, so it does not scale to very active systems (i.e. churning over
tens of thousands of jobs).
https://github.com/ubccr/xdmod-supremm
-Paul Edmon-
On 12/9/2018 8:39 AM, Aravindh Sampathkumar wrote:
Hi All.
I was