Thank you for your help, Sam! The rest of the slurm.conf, excluding the
node and partition configuration from the earlier email is below. I've also
included scontrol output for a 1 GPU job that runs successfully on node01.
Best,
Andrey
*Slurm.conf*
#
# See the slurm.conf man page for more infor
...and I'm not sure what "AutoDetect=NVML" is supposed to do in the
gres.conf file. We've always used "nvidia-smi topo -m" to confirm that
we've got a single-root or dual-root node and have entered the correct info
in gres.conf to map connections to the CPU sockets, e.g.:
# 8-gpu A6000 nodes -
Well... you've got lots of weirdness, as the scontrol show job command
isn't listing any GPU TRES requests, and the scontrol show node command
isn't listing any configured GPU TRES resources.
If you send me your entire slurm.conf I'll have a quick look-over.
You also should be using cgroup.conf t
Thank you Samuel,
Slurm version is 20.02.6. I'm not entirely sure about the platform, RTX6000
nodes are about 2 years old, and 3090 node is very recent. Technically we
have 4 nodes (hence references to node04 in info below), but one of the
nodes is down and out of the system at the moment. As you
What SLURM version are you running?
What are the #SLURM directives in the batch script? (or the sbatch
arguments)
When the single GPU jobs are pending, what's the output of 'scontrol show
job JOBID'?
What are the node definitions in slurm.conf, and the lines in gres.conf?
Are the nodes all the
Hello,
We are in the process of finishing up the setup of a cluster with 3 nodes,
4 GPUs each. One node has RTX3090s and the other 2 have RTX6000s.Any job
asking for 1 GPU in the submission script will wait to run on the 3090
node, no matter resource availability. Same job requesting 2 or more GPU