Re: [slurm-users] "fatal: can't stat gres.conf"

Nicholas McCollum Mon, 23 Jul 2018 16:15:28 -0700

You may want to check and make sure your GPUs are in persistance mode.  You can 
enable it through the nvidia-smi utility.

Nicholas McCollum
Alabama Supercomputer Authority
________________________________
From: Alex Chekholko <a...@calicolabs.com>
Sent: Monday, July 23, 2018 6:00 PM
To: Slurm User Community List
Subject: Re: [slurm-users] "fatal: can't stat gres.conf"

Thanks for the suggestion; if my memory serves me right, I had to do that 
previously to cause the drivers to load correctly after boot.

However, in this case both 'nvidia-smi' and 'nvidia-smi -L' run just fine and 
produce expected output.

One thing I see that is different is my older nodes have these two "uvm" 
devices:

working:
[root@n0035 ~]# ls -alhtr /dev/nvidia*
crw-rw-rw- 1 root root 195, 255 Nov  6  2017 /dev/nvidiactl
crw-rw-rw- 1 root root 195,   0 Nov  6  2017 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Nov  6  2017 /dev/nvidia1
crw-rw-rw- 1 root root 195,   2 Nov  6  2017 /dev/nvidia2
crw-rw-rw- 1 root root 195,   3 Nov  6  2017 /dev/nvidia3
crw-rw-rw- 1 root root 241,   1 Nov  7  2017 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 241,   0 Nov  7  2017 /dev/nvidia-uvm

not working:
[root@n0039 ~]# ls -alhtr /dev/nvidia*
crw-rw-rw- 1 root root 195, 255 Jul 12 17:09 /dev/nvidiactl
crw-rw-rw- 1 root root 195,   0 Jul 12 17:09 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Jul 12 17:09 /dev/nvidia1
crw-rw-rw- 1 root root 195,   2 Jul 12 17:09 /dev/nvidia2
crw-rw-rw- 1 root root 195,   3 Jul 12 17:09 /dev/nvidia3

On Mon, Jul 23, 2018 at 3:41 PM Bill 
<b...@simplehpc.com<mailto:b...@simplehpc.com>> wrote:
Hi Alex,

Try run nvidia-smi before start slurmd, I also found this issue. I have to run 
nvidia-smi before slurmd when I reboot system.
Regards,
Bill

------------------ Original ------------------
From: Alex Chekholko <a...@calicolabs.com<mailto:a...@calicolabs.com>>
Date: Tue,Jul 24,2018 6:10 AM
To: Slurm User Community List 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] "fatal: can't stat gres.conf"

Hi all,

I have a few working GPU compute nodes.  I bought a couple of more identical 
nodes.  They are all diskless; so they all boot from the same disk image.

For some reason slurmd refuses to start on the new nodes; and I'm not able to 
find any differences in hardware or software.  Google searches for "error: 
Waiting for gres.conf file " or "fatal: can't stat gres.conf file" are not 
helping.

The gres.conf file is there and identical on all nodes. The /dev/nvidia[0-3] 
files are there and 'nvidia-smi -L' works fine.  What am I missing?

[root@n0038 ~]# slurmd -Dcvvv
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug:  CPUs:20 Boards:1 Sockets:2 CoresPerSocket:10 ThreadsPerCore:1
slurmd: Node configuration differs from hardware: CPUs=16:20(hw) Boards=1:1(hw) 
SocketsPerBoard=16:2(hw) CoresPerSocket=1:10(hw) ThreadsPerCore=1:1(hw)
slurmd: Message aggregation disabled
slurmd: debug:  init: Gres GPU plugin loaded
slurmd: error: Waiting for gres.conf file /dev/nvidia[0-1],CPUs="0-9"
slurmd: fatal: can't stat gres.conf file /dev/nvidia[0-1],CPUs="0-9": No such 
file or directory

SLURM version ohpc-17.02.7-61

Re: [slurm-users] "fatal: can't stat gres.conf"

Reply via email to