Hi,

We have installed some new GPU nodes, and now users are asking for some sort of monitoring of GPU utilisation and GPU memory utilisation at the end of a job, like what Slurm already provides for CPU and memory usage.

I haven't found any pages describing how to perform GPU accounting within Slurm, so I would like to ask the user community for some advice on the best practices and any available (simple) tools out there.

What I have discovered is that Nvidia provides process accounting using nvidia-smi[1]. It is enabled with

$ nvidia-smi --accounting-mode=1

and queried with

$ nvidia-smi --query-accounted-apps=gpu_name,pid,time,gpu_util,mem_util,max_memory_usage --format=csv

but the documentation seems quite scant, and so far I don't see any output from this query command.

Some questions:

1. Is there a way to integrate the Nvidia process accounting into Slurm?

2. Can users run the above command in the job scripts and get the GPU accounting information?

Thanks,
Ole

References:
1. https://developer.nvidia.com/nvidia-system-management-interface

Reply via email to