Hello Slurm Admins,

 I have set up Slurm for a GPU-cluster. The basic installation without
gres/gpu works well. Now I try adding the GPUs to the Slurm configuration.
All attempts have failed so far and I always get with sinfo -R the message

gres/gpu count reported lower than configured ( 0 < 2 )

With nvidia-smi the GPUs are found and running jobs on them works well.
I have tried to get rid off the above error by updating the state to IDLE with
scontrol. That attempt also failed with error message

slurm_update error: Invalid node state specified

I ran slurmd on the GPU node with debug5 level. From slurmd.log I see that
gres.conf is found and gres_gpu.so / gpu_genric.so are loaded.

My Slurm configuration is as follows:

slurm.conf:
GresTypes=gpu
NodeName=hpc-node14 CPUs=128 RealMemory=515815 Sockets=2 CoresPerSocket=64 
ThreadsPerCore=1 Gres=gpu:2 State=UNKNOWN

gres.conf:
NodeName=hpc-node[01-14] Name=gpu File=/dev/nvidia[0-1]

Does anyone know what is wrong and how to fix that problem?
Thank you.


Best wishes
Achim


Reply via email to