[slurm-users] GPU utilization of running jobs

2022-10-19 Thread Vecerka Daniel

Hi,

 we want to push our users to run jobs with high GPU utilization. 
Because it's difficult for users to get GPU utilization of their jobs, I 
have decided to write script, which prints utilization of running jobs. 
The idea is simple:


 1. get list of running jobs in GPU partitions
 2. get IDs of allocated GPUs for each job in the step 1 (scontrol show 
job=$job_id -d )
 3. get via Prometheus API  utilization of the allocated GPU/s from 
step 2 in given period , when job is running.

https://github.com/NVIDIA/dcgm-exporter  is needed.

  It works fine for our Intel nodes with 4 V100 GPUs, but for our AMD 
nodes with 4 or 8 A100 GPUs there is  problem, that IDs of allocated 
GPUs  printed by scontrol show job=$job_id -d don't correspond with IDs 
which uses NVIDIA DCGM Exporter and nvidia-smi, so NVIDIA NML library.
GPU ID 1 in Slurm  is ID 0 for NML , 1 ->0, 2->3 3->2 on 4 GPU nodes  
and 0->2,1->3, 2->0, 3->1, 4->6, 5->7,6->4, 7-> 5 on 8 GPU nodes.


We are using Slurm 20.11.7 and  gres.conf  on intel nodes is:

AutoDetect=nvml
Name=gpu Type=v100 File=/dev/nvidia0
Name=gpu Type=v100 File=/dev/nvidia1
Name=gpu Type=v100 File=/dev/nvidia2
Name=gpu Type=v100 File=/dev/nvidia3

On AMD nodes

AutoDetect=nvml
Name=gpu Type=a100 File=/dev/nvidia0
Name=gpu Type=a100 File=/dev/nvidia1
Name=gpu Type=a100 File=/dev/nvidia2
Name=gpu Type=a100 File=/dev/nvidia3

There isn't problem to hack script to convert IDs for AMD nodes, so 
script works fine for all our nodes, but I would like to publish script 
on gitlab and prepare the script to be as universal as possible. My 
question is:  do you know, why Slurm uses sometimes different GPU IDs 
than Nvidia NML library?


Another question is: do you know how to store IDs of used GPUs in 
Slurmdb, so we can get GPU utilization of completed jobs?


We have in slurm.conf
AccountingStorageTRES=cpu,mem,gres/gpu

and only information what is stored, is number of allocated GPUs.

Thanks in advance,  Daniel Vecerka, CTU in Prague


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Usage gathering for GPUs

2023-06-06 Thread Vecerka Daniel

Hi all,

 I'm trying to get working the gathering of gres/gpumem and 
gres/gpuutil on Slurm 23.02.2 , but with no success yet.


We have:
AccountingStorageTRES=cpu,mem,gres/gpu
in the slurm.conf and Slurm is build with NVML support.

Autodetect=NVML
in gres.conf

gres/gpumem and gres/gpuutil now appears in sacct TRESUsageInAve record, 
but with zero values:


sacct -j 6056927_51 -Pno TRESUsageInAve

cpu=00:00:07,energy=0,fs/disk=14073059,gres/gpumem=0,gres/gpuutil=0,mem=6456K,pages=0,vmem=7052K
cpu=00:00:00,energy=0,fs/disk=2332,gres/gpumem=0,gres/gpuutil=0,mem=44K,pages=0,vmem=44K
cpu=05:18:51,energy=0,fs/disk=708800,gres/gpumem=0,gres/gpuutil=0,mem=2565376K,pages=0,vmem=2961244K

We are using NVIDIA Tesla V100 and A100 GPUs with driver version 
530.30.02. dcgm-exporter is working on the nodes.


Is there anything else needed, to get it working?

Thanks in advanced.    Daniel Vecerka


On 24. 05. 23 21:45, Christopher Samuel wrote:

On 5/24/23 11:39 am, Fulton, Ben wrote:


Hi,


Hi Ben,

The release notes for 23.02 say “Added usage gathering for gpu/nvml 
(Nvidia) and gpu/rsmi (AMD) plugins”.


How would I go about enabling this?


I can only comment on the nvidia side (as those are the GPUs we have) 
but for that you need Slurm built with NVML support and running with 
"Autodetect=NVML" in gres.conf and then that information is stored in 
slurmdbd as part of the TRES usage data.


For example to grab a job step for a test code I ran the other day:

csamuel@perlmutter:login01:~> sacct -j 9285567.0 -Pno TRESUsageInAve | 
tr , \\n | fgrep gpu

gres/gpumem=493120K
gres/gpuutil=76

Hope that helps!

All the best,
Chris


smime.p7s
Description: S/MIME Cryptographic Signature