I only get a line returned for “Gres=”, but this is the same behavior on another cluster that has GPUs and the variable gets set on that cluster.
-Sajesh- -- _____________________________________________________ Sajesh Singh Manager, Systems and Scientific Computing American Museum of Natural History 200 Central Park West New York, NY 10024 (O) (212) 313-7263 (C) (917) 763-9038 (E) ssi...@amnh.org From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Renfro, Michael Sent: Thursday, October 8, 2020 4:53 PM To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] CUDA environment variable not being set EXTERNAL SENDER From any node you can run scontrol from, what does ‘scontrol show node GPUNODENAME | grep -i gres’ return? Mine return lines for both “Gres=” and “CfgTRES=”. From: slurm-users <slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of Sajesh Singh <ssi...@amnh.org<mailto:ssi...@amnh.org>> Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>> Date: Thursday, October 8, 2020 at 3:33 PM To: Slurm User Community List <slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>> Subject: Re: [slurm-users] CUDA environment variable not being set External Email Warning This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests. ________________________________ It seems as though the modules are loaded as when I run lsmod I get the following: nvidia_drm 43714 0 nvidia_modeset 1109636 1 nvidia_drm nvidia_uvm 935322 0 nvidia 20390295 2 nvidia_modeset,nvidia_uvm Also the nvidia-smi command returns the following: nvidia-smi Thu Oct 8 16:31:57 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Quadro M5000 Off | 00000000:02:00.0 Off | Off | | 33% 21C P0 45W / 150W | 0MiB / 8126MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Quadro M5000 Off | 00000000:82:00.0 Off | Off | | 30% 17C P0 45W / 150W | 0MiB / 8126MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ -- -SS- From: slurm-users <slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>> On Behalf Of Relu Patrascu Sent: Thursday, October 8, 2020 4:26 PM To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] CUDA environment variable not being set EXTERNAL SENDER That usually means you don't have the nvidia kernel module loaded, probably because there's no driver installed. Relu On 2020-10-08 14:57, Sajesh Singh wrote: Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, but if I launch a job requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in the slurmd.log file: debug: common_gres_set_env: unable to set env vars, no device files configured Has anyone encountered this before? Thank you, SS