Hi Alex, What's the actual content of your gres.conf file? Seems to me that you have a trailing comma after the location of the nvidia device
Our gres.conf has NodeName=gpuhost[001-077] Name=gpu Type=p100 File=/dev/nvidia0 Cores=0,2,4,6,8,10,12,14,16,18,20,22 NodeName=gpuhost[001-077] Name=gpu Type=p100 File=/dev/nvidia1 Cores=0,2,4,6,8,10,12,14,16,18,20,22 NodeName=gpuhost[001-077] Name=gpu Type=p100 File=/dev/nvidia2 Cores=1,3,5,7,9,11,13,15,17,19,21,23 NodeName=gpuhost[001-077] Name=gpu Type=p100 File=/dev/nvidia3 Cores=1,3,5,7,9,11,13,15,17,19,21,23 I think you have a comma between the File and Cores/CPUs Sean On Tue, 24 Jul 2018 at 08:13, Alex Chekholko <a...@calicolabs.com> wrote: > Hi all, > > I have a few working GPU compute nodes. I bought a couple of more > identical nodes. They are all diskless; so they all boot from the same > disk image. > > For some reason slurmd refuses to start on the new nodes; and I'm not able > to find any differences in hardware or software. Google searches for > "error: Waiting for gres.conf file " or "fatal: can't stat gres.conf file" > are not helping. > > The gres.conf file is there and identical on all nodes. The > /dev/nvidia[0-3] files are there and 'nvidia-smi -L' works fine. What am I > missing? > > > [root@n0038 ~]# slurmd -Dcvvv > slurmd: debug2: hwloc_topology_init > slurmd: debug2: hwloc_topology_load > slurmd: debug: CPUs:20 Boards:1 Sockets:2 CoresPerSocket:10 > ThreadsPerCore:1 > slurmd: Node configuration differs from hardware: CPUs=16:20(hw) > Boards=1:1(hw) SocketsPerBoard=16:2(hw) CoresPerSocket=1:10(hw) > ThreadsPerCore=1:1(hw) > slurmd: Message aggregation disabled > slurmd: debug: init: Gres GPU plugin loaded > slurmd: error: Waiting for gres.conf file /dev/nvidia[0-1],CPUs="0-9" > slurmd: fatal: can't stat gres.conf file /dev/nvidia[0-1],CPUs="0-9": No > such file or directory > > > > SLURM version ohpc-17.02.7-61 > >