Hi,
We have installed some new GPU nodes, and now users are asking for some
sort of monitoring of GPU utilisation and GPU memory utilisation at the
end of a job, like what Slurm already provides for CPU and memory usage.
I haven't found any pages describing how to perform GPU accounting within
Slurm, so I would like to ask the user community for some advice on the
best practices and any available (simple) tools out there.
What I have discovered is that Nvidia provides process accounting using
nvidia-smi[1]. It is enabled with
$ nvidia-smi --accounting-mode=1
and queried with
$ nvidia-smi
--query-accounted-apps=gpu_name,pid,time,gpu_util,mem_util,max_memory_usage
--format=csv
but the documentation seems quite scant, and so far I don't see any output
from this query command.
Some questions:
1. Is there a way to integrate the Nvidia process accounting into Slurm?
2. Can users run the above command in the job scripts and get the GPU
accounting information?
Thanks,
Ole
References:
1. https://developer.nvidia.com/nvidia-system-management-interface