Hi David,if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always starts from zero. So this is NOT the index of the GPU.
Just verified it: $> nvidia-smi Tue Jun 16 13:28:47 2020 +-----------------------------------------------------------------------------+| NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
... +-----------------------------------------------------------------------------+| Processes: GPU Memory | | GPU PID Type Process name Usage |
|=============================================================================|| 0 17269 C gmx_mpi 679MiB | | 1 19246 C gmx_mpi 513MiB |
+-----------------------------------------------------------------------------+ $> squeue -w nrg04JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
14560009 c18g_low egf5 bk449967 R 1-00:17:48 1 nrg04 14560005 c18g_low egf1 bk449967 R 1-00:20:23 1 nrg04 $> scontrol show job -d 14560005 ... Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=* Nodes=nrg04 CPU_IDs=0-23 Mem=93600 GRES_IDX=gpu(IDX:0) $> scontrol show job -d 14560009 JobId=14560009 JobName=egf5 ... Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=* Nodes=nrg04 CPU_IDs=24-47 Mem=93600 GRES_IDX=gpu(IDX:1) From the PIDs from nvidia-smi ouput: $> xargs --null --max-args=1 echo < /proc/17269/environ | grep CUDA_VISIBLE CUDA_VISIBLE_DEVICES=0 $> xargs --null --max-args=1 echo < /proc/19246/environ | grep CUDA_VISIBLE CUDA_VISIBLE_DEVICES=0 So this is only a way to see how MANY devices were used, not which. Best Marcus Am 10.06.2020 um 20:49 schrieb David Braun:
Hi Kota, This is from the job template that I give to my users: # Collect some information about the execution environment that may # be useful should we need to do some debugging. echo "CREATING DEBUG DIRECTORY" echo mkdir .debug_info module list > .debug_info/environ_modules 2>&1 ulimit -a > .debug_info/limits 2>&1 hostname > .debug_info/environ_hostname 2>&1 env |grep SLURM > .debug_info/environ_slurm 2>&1 env |grep OMP |grep -v OMPI > .debug_info/environ_omp 2>&1 env |grep OMPI > .debug_info/environ_openmpi 2>&1 env > .debug_info/environ 2>&1 if [ ! -z ${CUDA_VISIBLE_DEVICES+x} ]; then echo "SAVING CUDA ENVIRONMENT" echo env |grep CUDA > .debug_info/environ_cuda 2>&1 fiYou could add something like this to one of the SLURM prologs to save the GPU list of jobs.Best, DavidOn Thu, Jun 4, 2020 at 4:02 AM Kota Tsuyuzaki <kota.tsuyuzaki...@hco.ntt.co.jp <mailto:kota.tsuyuzaki...@hco.ntt.co.jp>> wrote:Hello Guys, We are running GPU clusters with Slurm and SlurmDBD (version 19.05 series) and some of GPUs seemed to get troubles for attached jobs. To investigate if the troubles happened on the same GPUs, I'd like to get GPU indices of the completed jobs. In my understanding `scontrol show job` can show the indices (as IDX in gres info) but cannot be used for completed job. And also `sacct -j` is available for complete jobs but won't print the indices. Is there any way (commands, configurations, etc...) to see the allocated GPU indices for completed jobs? Best regards, -------------------------------------------- 露崎 浩太 (Kota Tsuyuzaki) kota.tsuyuzaki...@hco.ntt.co.jp <mailto:kota.tsuyuzaki...@hco.ntt.co.jp> NTTソフトウェアイノベーションセンタ 分散処理基盤技術プロジェクト 0422-59-2837 ---------------------------------------------
-- Dipl.-Inf. Marcus Wagner IT Center Gruppe: Systemgruppe Linux Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wag...@itc.rwth-aachen.de www.itc.rwth-aachen.de Social Media Kanäle des IT Centers: https://blog.rwth-aachen.de/itc/ https://www.facebook.com/itcenterrwth https://www.linkedin.com/company/itcenterrwth https://twitter.com/ITCenterRWTH https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ
smime.p7s
Description: S/MIME Cryptographic Signature