Hi Kota,

thanks for the hint.

Yet, I'm still a little bit astonished, as if I remember right, CUDA_VISIBLE_DEVICES in a cgroup always start from zero. That has been already years ago, as we still used LSF.

But SLURM_JOB_GPUS seems to be the right thing:

same node, two different users (and therefore jobs)


$> xargs --null --max-args=1 echo < /proc/32719/environ | egrep "GPU|CUDA"
SLURM_JOB_GPUS=0
CUDA_VISIBLE_DEVICES=0
GPU_DEVICE_ORDINAL=0

$> xargs --null --max-args=1 echo < /proc/109479/environ | egrep "GPU|CUDA"
SLURM_MEM_PER_GPU=6144
SLURM_JOB_GPUS=1
CUDA_VISIBLE_DEVICES=0
GPU_DEVICE_ORDINAL=0
CUDA_ROOT=/usr/local_rwth/sw/cuda/10.1.243
CUDA_PATH=/usr/local_rwth/sw/cuda/10.1.243
CUDA_VERSION=101

SLURM_JOB_GPU differs

$> scontrol show -d job 14658274
...
Nodes=nrg02 CPU_IDs=24 Mem=8192 GRES_IDX=gpu:volta(IDX:1)

$> scontrol show -d job 14673550
...
Nodes=nrg02 CPU_IDs=0 Mem=8192 GRES_IDX=gpu:volta(IDX:0)



Is there anyone out there, who can confirm this besides me?


Best
Marcus


Am 23.06.2020 um 04:51 schrieb Kota Tsuyuzaki:
if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always
starts from zero. So this is NOT the index of the GPU.

Thanks. Just FYI, when I tested the environment variables with Slurm 19.05.2 + 
proctrack/cgroup configuration, It looks CUDA_VISIBLE_DEVICES fits the indices 
on the host devices (i.e. not started from zero). I'm not sure if the behavior 
would be changed in the newer Slurm version though.

I also found that SLURM_JOB_GPUS and GPU_DEVICE_ORDIGNAL was set in environment 
variables that can be useful. In my current tests, those variables ware being 
same values with CUDA_VISILE_DEVICES.

Any advices on what I should look for, is always welcome..

Best,
Kota

-----Original Message-----
From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Marcus 
Wagner
Sent: Tuesday, June 16, 2020 9:17 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] How to view GPU indices of the completed jobs?

Hi David,

if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always
starts from zero. So this is NOT the index of the GPU.

Just verified it:
$> nvidia-smi
Tue Jun 16 13:28:47 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version:
10.2     |
...
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU
Memory |
|  GPU       PID   Type   Process name                             Usage
       |
|=========================================================================
====|
|    0     17269      C   gmx_mpi
679MiB |
|    1     19246      C   gmx_mpi
513MiB |
+-----------------------------------------------------------------------------+

$> squeue -w nrg04
               JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
            14560009  c18g_low     egf5 bk449967  R 1-00:17:48      1 nrg04
            14560005  c18g_low     egf1 bk449967  R 1-00:20:23      1 nrg04


$> scontrol show job -d 14560005
...
     Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
       Nodes=nrg04 CPU_IDs=0-23 Mem=93600 GRES_IDX=gpu(IDX:0)

$> scontrol show job -d 14560009
JobId=14560009 JobName=egf5
...
     Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
       Nodes=nrg04 CPU_IDs=24-47 Mem=93600 GRES_IDX=gpu(IDX:1)

  From the PIDs from nvidia-smi ouput:

$> xargs --null --max-args=1 echo < /proc/17269/environ | grep CUDA_VISIBLE
CUDA_VISIBLE_DEVICES=0

$> xargs --null --max-args=1 echo < /proc/19246/environ | grep CUDA_VISIBLE
CUDA_VISIBLE_DEVICES=0


So this is only a way to see how MANY devices were used, not which.


Best
Marcus

Am 10.06.2020 um 20:49 schrieb David Braun:
Hi Kota,

This is from the job template that I give to my users:

# Collect some information about the execution environment that may
# be useful should we need to do some debugging.

echo "CREATING DEBUG DIRECTORY"
echo

mkdir .debug_info
module list > .debug_info/environ_modules 2>&1
ulimit -a > .debug_info/limits 2>&1
hostname > .debug_info/environ_hostname 2>&1
env |grep SLURM > .debug_info/environ_slurm 2>&1
env |grep OMP |grep -v OMPI > .debug_info/environ_omp 2>&1
env |grep OMPI > .debug_info/environ_openmpi 2>&1
env > .debug_info/environ 2>&1

if [ ! -z ${CUDA_VISIBLE_DEVICES+x} ]; then
          echo "SAVING CUDA ENVIRONMENT"
          echo
          env |grep CUDA > .debug_info/environ_cuda 2>&1
fi

You could add something like this to one of the SLURM prologs to save
the GPU list of jobs.

Best,

David

On Thu, Jun 4, 2020 at 4:02 AM Kota Tsuyuzaki
<kota.tsuyuzaki...@hco.ntt.co.jp
<mailto:kota.tsuyuzaki...@hco.ntt.co.jp>> wrote:

     Hello Guys,

     We are running GPU clusters with Slurm and SlurmDBD (version 19.05
     series) and some of GPUs seemed to get troubles for attached
     jobs. To investigate if the troubles happened on the same GPUs, I'd
     like to get GPU indices of the completed jobs.

     In my understanding `scontrol show job` can show the indices (as IDX
     in gres info) but cannot be used for completed job. And also
     `sacct -j` is available for complete jobs but won't print the indices.

     Is there any way (commands, configurations, etc...) to see the
     allocated GPU indices for completed jobs?

     Best regards,

     --------------------------------------------
     露崎 浩太 (Kota Tsuyuzaki)
     kota.tsuyuzaki...@hco.ntt.co.jp <mailto:kota.tsuyuzaki...@hco.ntt.co.jp>
     NTTソフトウェアイノベーションセンタ
     分散処理基盤技術プロジェクト
     0422-59-2837
     ---------------------------------------------






--
Dipl.-Inf. Marcus Wagner

IT Center
Gruppe: Systemgruppe Linux
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Social Media Kanäle des IT Centers:
https://blog.rwth-aachen.de/itc/
https://www.facebook.com/itcenterrwth
https://www.linkedin.com/company/itcenterrwth
https://twitter.com/ITCenterRWTH
https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ





--
Dipl.-Inf. Marcus Wagner

IT Center
Gruppe: Systemgruppe Linux
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Social Media Kanäle des IT Centers:
https://blog.rwth-aachen.de/itc/
https://www.facebook.com/itcenterrwth
https://www.linkedin.com/company/itcenterrwth
https://twitter.com/ITCenterRWTH
https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to