[slurm-users] GPU utilization of running jobs
Hi, we want to push our users to run jobs with high GPU utilization. Because it's difficult for users to get GPU utilization of their jobs, I have decided to write script, which prints utilization of running jobs. The idea is simple: 1. get list of running jobs in GPU partitions 2. get IDs of allocated GPUs for each job in the step 1 (scontrol show job=$job_id -d ) 3. get via Prometheus API utilization of the allocated GPU/s from step 2 in given period , when job is running. https://github.com/NVIDIA/dcgm-exporter is needed. It works fine for our Intel nodes with 4 V100 GPUs, but for our AMD nodes with 4 or 8 A100 GPUs there is problem, that IDs of allocated GPUs printed by scontrol show job=$job_id -d don't correspond with IDs which uses NVIDIA DCGM Exporter and nvidia-smi, so NVIDIA NML library. GPU ID 1 in Slurm is ID 0 for NML , 1 ->0, 2->3 3->2 on 4 GPU nodes and 0->2,1->3, 2->0, 3->1, 4->6, 5->7,6->4, 7-> 5 on 8 GPU nodes. We are using Slurm 20.11.7 and gres.conf on intel nodes is: AutoDetect=nvml Name=gpu Type=v100 File=/dev/nvidia0 Name=gpu Type=v100 File=/dev/nvidia1 Name=gpu Type=v100 File=/dev/nvidia2 Name=gpu Type=v100 File=/dev/nvidia3 On AMD nodes AutoDetect=nvml Name=gpu Type=a100 File=/dev/nvidia0 Name=gpu Type=a100 File=/dev/nvidia1 Name=gpu Type=a100 File=/dev/nvidia2 Name=gpu Type=a100 File=/dev/nvidia3 There isn't problem to hack script to convert IDs for AMD nodes, so script works fine for all our nodes, but I would like to publish script on gitlab and prepare the script to be as universal as possible. My question is: do you know, why Slurm uses sometimes different GPU IDs than Nvidia NML library? Another question is: do you know how to store IDs of used GPUs in Slurmdb, so we can get GPU utilization of completed jobs? We have in slurm.conf AccountingStorageTRES=cpu,mem,gres/gpu and only information what is stored, is number of allocated GPUs. Thanks in advance, Daniel Vecerka, CTU in Prague smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] Usage gathering for GPUs
Hi all, I'm trying to get working the gathering of gres/gpumem and gres/gpuutil on Slurm 23.02.2 , but with no success yet. We have: AccountingStorageTRES=cpu,mem,gres/gpu in the slurm.conf and Slurm is build with NVML support. Autodetect=NVML in gres.conf gres/gpumem and gres/gpuutil now appears in sacct TRESUsageInAve record, but with zero values: sacct -j 6056927_51 -Pno TRESUsageInAve cpu=00:00:07,energy=0,fs/disk=14073059,gres/gpumem=0,gres/gpuutil=0,mem=6456K,pages=0,vmem=7052K cpu=00:00:00,energy=0,fs/disk=2332,gres/gpumem=0,gres/gpuutil=0,mem=44K,pages=0,vmem=44K cpu=05:18:51,energy=0,fs/disk=708800,gres/gpumem=0,gres/gpuutil=0,mem=2565376K,pages=0,vmem=2961244K We are using NVIDIA Tesla V100 and A100 GPUs with driver version 530.30.02. dcgm-exporter is working on the nodes. Is there anything else needed, to get it working? Thanks in advanced. Daniel Vecerka On 24. 05. 23 21:45, Christopher Samuel wrote: On 5/24/23 11:39 am, Fulton, Ben wrote: Hi, Hi Ben, The release notes for 23.02 say “Added usage gathering for gpu/nvml (Nvidia) and gpu/rsmi (AMD) plugins”. How would I go about enabling this? I can only comment on the nvidia side (as those are the GPUs we have) but for that you need Slurm built with NVML support and running with "Autodetect=NVML" in gres.conf and then that information is stored in slurmdbd as part of the TRES usage data. For example to grab a job step for a test code I ran the other day: csamuel@perlmutter:login01:~> sacct -j 9285567.0 -Pno TRESUsageInAve | tr , \\n | fgrep gpu gres/gpumem=493120K gres/gpuutil=76 Hope that helps! All the best, Chris smime.p7s Description: S/MIME Cryptographic Signature