Re: [slurm-users] GPU jobs not running correctly

2021-08-20 Thread Andrey Malyutin
Thank you for your help, Sam! The rest of the slurm.conf, excluding the node and partition configuration from the earlier email is below. I've also included scontrol output for a 1 GPU job that runs successfully on node01. Best, Andrey *Slurm.conf* # # See the slurm.conf man page for more infor

Re: [slurm-users] GPU jobs not running correctly

2021-08-20 Thread Fulcomer, Samuel
...and I'm not sure what "AutoDetect=NVML" is supposed to do in the gres.conf file. We've always used "nvidia-smi topo -m" to confirm that we've got a single-root or dual-root node and have entered the correct info in gres.conf to map connections to the CPU sockets, e.g.: # 8-gpu A6000 nodes -

Re: [slurm-users] GPU jobs not running correctly

2021-08-20 Thread Fulcomer, Samuel
Well... you've got lots of weirdness, as the scontrol show job command isn't listing any GPU TRES requests, and the scontrol show node command isn't listing any configured GPU TRES resources. If you send me your entire slurm.conf I'll have a quick look-over. You also should be using cgroup.conf t

Re: [slurm-users] GPU jobs not running correctly

2021-08-20 Thread Andrey Malyutin
Thank you Samuel, Slurm version is 20.02.6. I'm not entirely sure about the platform, RTX6000 nodes are about 2 years old, and 3090 node is very recent. Technically we have 4 nodes (hence references to node04 in info below), but one of the nodes is down and out of the system at the moment. As you

Re: [slurm-users] GPU jobs not running correctly

2021-08-19 Thread Fulcomer, Samuel
What SLURM version are you running? What are the #SLURM directives in the batch script? (or the sbatch arguments) When the single GPU jobs are pending, what's the output of 'scontrol show job JOBID'? What are the node definitions in slurm.conf, and the lines in gres.conf? Are the nodes all the

[slurm-users] GPU jobs not running correctly

2021-08-19 Thread Andrey Malyutin
Hello, We are in the process of finishing up the setup of a cluster with 3 nodes, 4 GPUs each. One node has RTX3090s and the other 2 have RTX6000s.Any job asking for 1 GPU in the submission script will wait to run on the 3090 node, no matter resource availability. Same job requesting 2 or more GPU