Thanks for the suggestion; if my memory serves me right, I had to do that previously to cause the drivers to load correctly after boot.
However, in this case both 'nvidia-smi' and 'nvidia-smi -L' run just fine and produce expected output. One thing I see that is different is my older nodes have these two "uvm" devices: working: [root@n0035 ~]# ls -alhtr /dev/nvidia* crw-rw-rw- 1 root root 195, 255 Nov 6 2017 /dev/nvidiactl crw-rw-rw- 1 root root 195, 0 Nov 6 2017 /dev/nvidia0 crw-rw-rw- 1 root root 195, 1 Nov 6 2017 /dev/nvidia1 crw-rw-rw- 1 root root 195, 2 Nov 6 2017 /dev/nvidia2 crw-rw-rw- 1 root root 195, 3 Nov 6 2017 /dev/nvidia3 crw-rw-rw- 1 root root 241, 1 Nov 7 2017 /dev/nvidia-uvm-tools crw-rw-rw- 1 root root 241, 0 Nov 7 2017 /dev/nvidia-uvm not working: [root@n0039 ~]# ls -alhtr /dev/nvidia* crw-rw-rw- 1 root root 195, 255 Jul 12 17:09 /dev/nvidiactl crw-rw-rw- 1 root root 195, 0 Jul 12 17:09 /dev/nvidia0 crw-rw-rw- 1 root root 195, 1 Jul 12 17:09 /dev/nvidia1 crw-rw-rw- 1 root root 195, 2 Jul 12 17:09 /dev/nvidia2 crw-rw-rw- 1 root root 195, 3 Jul 12 17:09 /dev/nvidia3 On Mon, Jul 23, 2018 at 3:41 PM Bill <b...@simplehpc.com> wrote: > Hi Alex, > > Try run nvidia-smi before start slurmd, I also found this issue. I have to > run nvidia-smi before slurmd when I reboot system. > Regards, > Bill > > > ------------------ Original ------------------ > *From:* Alex Chekholko <a...@calicolabs.com> > *Date:* Tue,Jul 24,2018 6:10 AM > *To:* Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject:* Re: [slurm-users] "fatal: can't stat gres.conf" > > Hi all, > > I have a few working GPU compute nodes. I bought a couple of more > identical nodes. They are all diskless; so they all boot from the same > disk image. > > For some reason slurmd refuses to start on the new nodes; and I'm not able > to find any differences in hardware or software. Google searches for > "error: Waiting for gres.conf file " or "fatal: can't stat gres.conf file" > are not helping. > > The gres.conf file is there and identical on all nodes. The > /dev/nvidia[0-3] files are there and 'nvidia-smi -L' works fine. What am I > missing? > > > [root@n0038 ~]# slurmd -Dcvvv > slurmd: debug2: hwloc_topology_init > slurmd: debug2: hwloc_topology_load > slurmd: debug: CPUs:20 Boards:1 Sockets:2 CoresPerSocket:10 > ThreadsPerCore:1 > slurmd: Node configuration differs from hardware: CPUs=16:20(hw) > Boards=1:1(hw) SocketsPerBoard=16:2(hw) CoresPerSocket=1:10(hw) > ThreadsPerCore=1:1(hw) > slurmd: Message aggregation disabled > slurmd: debug: init: Gres GPU plugin loaded > slurmd: error: Waiting for gres.conf file /dev/nvidia[0-1],CPUs="0-9" > slurmd: fatal: can't stat gres.conf file /dev/nvidia[0-1],CPUs="0-9": No > such file or directory > > > > SLURM version ohpc-17.02.7-61 > >