Hi all, I have a few working GPU compute nodes. I bought a couple of more identical nodes. They are all diskless; so they all boot from the same disk image.
For some reason slurmd refuses to start on the new nodes; and I'm not able to find any differences in hardware or software. Google searches for "error: Waiting for gres.conf file " or "fatal: can't stat gres.conf file" are not helping. The gres.conf file is there and identical on all nodes. The /dev/nvidia[0-3] files are there and 'nvidia-smi -L' works fine. What am I missing? [root@n0038 ~]# slurmd -Dcvvv slurmd: debug2: hwloc_topology_init slurmd: debug2: hwloc_topology_load slurmd: debug: CPUs:20 Boards:1 Sockets:2 CoresPerSocket:10 ThreadsPerCore:1 slurmd: Node configuration differs from hardware: CPUs=16:20(hw) Boards=1:1(hw) SocketsPerBoard=16:2(hw) CoresPerSocket=1:10(hw) ThreadsPerCore=1:1(hw) slurmd: Message aggregation disabled slurmd: debug: init: Gres GPU plugin loaded slurmd: error: Waiting for gres.conf file /dev/nvidia[0-1],CPUs="0-9" slurmd: fatal: can't stat gres.conf file /dev/nvidia[0-1],CPUs="0-9": No such file or directory SLURM version ohpc-17.02.7-61