Hi,
My cluster has 2 nodes, with the first having 2 gpus and the second 1 gpu.
The states of both nodes is "drained" because "gres/gpu count reported
lower than configured": any idea why this happens? Thanks.
My .conf files are:
slurm.conf
AccountingStorageTRES=gres/gpu
GresTypes=gpu
NodeName=t
After a complete shutdown and restart of all daemons, things have changed
somewhat
# scontrol show nodes | egrep '(^Node|Gres)'
NodeName=mlscgpu1 Arch=x86_64 CoresPerSocket=16
Gres=gpu:quadro_rtx_6000:10(S:0)
NodeName=mlscgpu2 Arch=x86_64 CoresPerSocket=16
Gres=gpu:quadro_rtx_6000:5(S:0)
I have two systems in my cluster with GPUs. Their setup in slurm.conf is
GresTypes=gpu
NodeName=mlscgpu1 Gres=gpu:quadro_rtx_6000:10 CPUs=64 Boards=1
SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1546557
NodeName=mlscgpu2 Gres=gpu:quadro_rtx_6000:5 CPUs=64 Boards=1
SocketsP