Hi,

I am trying to set up Slurm with GPUs as GRES on a 3 node
configuration (hostnames: server1, server2, server3).

For a while everything looked fine and I was able to run

srun --label --nodes=3 hostname

which is what I use to test if Slurm is working correctly, and then it
randomly stops.

Turns out slurmctld is not working and it throws the following error (the
two lines are consecutive in the log file):

root@server1:/var/log# grep -i error slurmctld.log
[2024-07-22T14:47:32.302] error: Error binding slurm stream socket:
Address already in use[2024-07-22T14:47:32.302] fatal:
slurm_init_msg_engine_port error Address already in use

This error is being thrown after having made no changes to the config
files, in fact the cluster wasn't used at all for a few weeks before this
error was thrown.

This is the simple script I use to restart Slurm:

root@server1:~# cat slurmRestart.sh #! /bin/bashscp
/etc/slurm/slurm.conf server2:/etc/slurm/ && echo copied slurm.conf to
server2;scp /etc/slurm/slurm.conf server3:/etc/slurm/ && echo copied
slurm.conf to server3;rm /var/log/slurmd.log /var/log/slurmctld.log ;
systemctl restart slurmd slurmctld ; echo restarting slurm on
server1;(ssh server2 "rm /var/log/slurmd.log /var/log/slurmctld.log ;
systemctl restart slurmd slurmctld") && echo restarting slurm on
server2;(ssh server3 "rm /var/log/slurmd.log /var/log/slurmctld.log ;
systemctl restart slurmd slurmctld") && echo restarting slurm on
server3;

Could the error be due to the slurmd and/or slurmctld not being started in
the right order?

The other question I have is regarding the configuration of a GPU as a GRES
- how do I verify that it has been configured correctly? I was told to use
srun nvidia-smi with and without having enabled GPU use, but whether or not
I enable GPU usage has no effect on the output of the command:

root@server1:~# srun --nodes=1 nvidia-smi --query-gpu=uuid
--format=csvuuidGPU-55f127a8-dbf4-fd12-3cad-c0d5f2dcb005root@server1:~#
root@server1:~# srun --nodes=1 --gpus-per-node=1 nvidia-smi
--query-gpu=uuid
--format=csvuuidGPU-55f127a8-dbf4-fd12-3cad-c0d5f2dcb005

I am sceptical if about whether the GPU has properly been configured, is
this the best way to check if it has?

*The error:*
I first noticed this happening when I tried to run the command I usually
use to see if everything is fine, the srun command runs only one node, and
the only way to stop it if I specify the number of nodes as 3 is to press
Ctrl+C:

root@server1:~# srun --label --nodes=1 hostname0:
server1root@server1:~# ssh server2 "srun --label --nodes=1 hostname"0:
server1root@server1:~# ssh server3 "srun --label --nodes=1 hostname"0:
server1root@server1:~# srun --label --nodes=3 hostnamesrun: Required
node not available (down, drained or reserved)srun: job 265 queued and
waiting for resources^Csrun: Job allocation 265 has been revokedsrun:
Force Terminated JobId=265root@server1:~# ssh server2 "srun --label
--nodes=3 hostname"srun: Required node not available (down, drained or
reserved)srun: job 266 queued and waiting for
resources^Croot@server1:~# ssh server3 "srun --label --nodes=3
hostname"srun: Required node not available (down, drained or
reserved)srun: job 267 queued and waiting for resourcesroot@server1:~#


*The logs:*
1) The last 30 lines of */var/log/slurmctld.log* at the debug5 level in
server #1 (pastebin to the entire log <https://pastebin.com/fw4C4xtr>):

root@server1:/var/log# tail -30 slurmctld.log
[2024-07-22T14:47:32.301] debug:  Updating partition uid access
list[2024-07-22T14:47:32.301] debug3: create_mmap_buf: loaded file
`/var/spool/slurmctld/resv_state` as buf_t[2024-07-22T14:47:32.301]
debug3: Version string in resv_state header is
PROTOCOL_VERSION[2024-07-22T14:47:32.301] Recovered state of 0
reservations[2024-07-22T14:47:32.301] debug3: create_mmap_buf: loaded
file `/var/spool/slurmctld/trigger_state` as
buf_t[2024-07-22T14:47:32.301] State of 0 triggers
recovered[2024-07-22T14:47:32.301] read_slurm_conf: backup_controller
not specified[2024-07-22T14:47:32.301] select/cons_tres:
select_p_reconfigure: select/cons_tres:
reconfigure[2024-07-22T14:47:32.301] select/cons_tres:
part_data_create_array: select/cons_tres: preparing for 1
partitions[2024-07-22T14:47:32.301] debug:  power_save module
disabled, SuspendTime < 0[2024-07-22T14:47:32.301] Running as primary
controller[2024-07-22T14:47:32.301] debug:  No backup controllers, not
launching heartbeat.[2024-07-22T14:47:32.301] debug3: Trying to load
plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/priority_basic.so[2024-07-22T14:47:32.301]
debug3: plugin_load_from_file->_verify_syms: found Slurm plugin
name:Priority BASIC plugin type:priority/basic
version:0x160508[2024-07-22T14:47:32.301] debug:  priority/basic:
init: Priority BASIC plugin loaded[2024-07-22T14:47:32.301] debug3:
Success.[2024-07-22T14:47:32.301] No parameter for mcs plugin, default
values set[2024-07-22T14:47:32.301] mcs: MCSParameters = (null).
ondemand set.[2024-07-22T14:47:32.301] debug3: Trying to load plugin
/usr/lib/x86_64-linux-gnu/slurm-wlm/mcs_none.so[2024-07-22T14:47:32.301]
debug3: plugin_load_from_file->_verify_syms: found Slurm plugin
name:mcs none plugin type:mcs/none
version:0x160508[2024-07-22T14:47:32.301] debug:  mcs/none: init: mcs
none plugin loaded[2024-07-22T14:47:32.301] debug3:
Success.[2024-07-22T14:47:32.302] debug3: _slurmctld_rpc_mgr pid =
3159324[2024-07-22T14:47:32.302] debug3: _slurmctld_background pid =
3159324[2024-07-22T14:47:32.302] error: Error binding slurm stream
socket: Address already in use[2024-07-22T14:47:32.302] fatal:
slurm_init_msg_engine_port error Address already in
use[2024-07-22T14:47:32.304] slurmscriptd: debug3: Called
_handle_close[2024-07-22T14:47:32.304] slurmscriptd: debug4: eio:
handling events for 1 objects[2024-07-22T14:47:32.304] slurmscriptd:
debug3: Called _msg_readable[2024-07-22T14:47:32.304] slurmscriptd:
debug:  _slurmscriptd_mainloop: finished

2) Entirety of *slurmctld.log on server #2*:

root@server2:/var/log# cat slurmctld.log [2024-07-22T14:47:32.614]
debug:  slurmctld log levels: stderr=debug5 logfile=debug5
syslog=quiet[2024-07-22T14:47:32.614] debug:  Log file
re-opened[2024-07-22T14:47:32.615] slurmscriptd: debug:  slurmscriptd:
Got ack from slurmctld, initialization
successful[2024-07-22T14:47:32.615] slurmscriptd: debug:
_slurmscriptd_mainloop: started[2024-07-22T14:47:32.616] slurmscriptd:
debug4: eio: handling events for 1 objects[2024-07-22T14:47:32.616]
debug:  slurmctld: slurmscriptd fork()'d and
initialized.[2024-07-22T14:47:32.616] slurmscriptd: debug3: Called
_msg_readable[2024-07-22T14:47:32.616] debug:
_slurmctld_listener_thread: started listening to
slurmscriptd[2024-07-22T14:47:32.616] debug4: eio: handling events for
1 objects[2024-07-22T14:47:32.616] debug3: Called
_msg_readable[2024-07-22T14:47:32.616] slurmctld version 22.05.8
started on cluster dlabcluster[2024-07-22T14:47:32.616] debug3: Trying
to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/cred_munge.so[2024-07-22T14:47:32.616]
debug3: plugin_load_from_file->_verify_syms: found Slurm plugin
name:Munge credential signature plugin type:cred/munge
version:0x160508[2024-07-22T14:47:32.616] cred/munge: init: Munge
credential signature plugin loaded[2024-07-22T14:47:32.616] debug3:
Success.[2024-07-22T14:47:32.616] error: This host (server2/server2)
not a valid controller[2024-07-22T14:47:32.617] slurmscriptd: debug3:
Called _handle_close[2024-07-22T14:47:32.617] slurmscriptd: debug4:
eio: handling events for 1 objects[2024-07-22T14:47:32.617]
slurmscriptd: debug3: Called _msg_readable[2024-07-22T14:47:32.617]
slurmscriptd: debug:  _slurmscriptd_mainloop: finished

3) Entirety of *slurmctld.log on server #3*:

root@server3:/var/log# cat slurmctld.log [2024-07-22T14:47:32.927]
debug:  slurmctld log levels: stderr=debug5 logfile=debug5
syslog=quiet[2024-07-22T14:47:32.927] debug:  Log file
re-opened[2024-07-22T14:47:32.928] slurmscriptd: debug:  slurmscriptd:
Got ack from slurmctld, initialization
successful[2024-07-22T14:47:32.928] slurmscriptd: debug:
_slurmscriptd_mainloop: started[2024-07-22T14:47:32.928] slurmscriptd:
debug4: eio: handling events for 1 objects[2024-07-22T14:47:32.928]
debug:  slurmctld: slurmscriptd fork()'d and
initialized.[2024-07-22T14:47:32.928] slurmscriptd: debug3: Called
_msg_readable[2024-07-22T14:47:32.928] slurmctld version 22.05.8
started on cluster dlabcluster[2024-07-22T14:47:32.929] debug:
_slurmctld_listener_thread: started listening to
slurmscriptd[2024-07-22T14:47:32.929] debug4: eio: handling events for
1 objects[2024-07-22T14:47:32.929] debug3: Called
_msg_readable[2024-07-22T14:47:32.929] debug3: Trying to load plugin
/usr/lib/x86_64-linux-gnu/slurm-wlm/cred_munge.so[2024-07-22T14:47:32.929]
debug3: plugin_load_from_file->_verify_syms: found Slurm plugin
name:Munge credential signature plugin type:cred/munge
version:0x160508[2024-07-22T14:47:32.929] cred/munge: init: Munge
credential signature plugin loaded[2024-07-22T14:47:32.929] debug3:
Success.[2024-07-22T14:47:32.929] error: This host (server3/server3)
not a valid controller[2024-07-22T14:47:32.930] slurmscriptd: debug3:
Called _handle_close[2024-07-22T14:47:32.930] slurmscriptd: debug4:
eio: handling events for 1 objects[2024-07-22T14:47:32.930]
slurmscriptd: debug3: Called _msg_readable[2024-07-22T14:47:32.930]
slurmscriptd: debug:  _slurmscriptd_mainloop: finished


*The config files (shared by all 3 computers):*
1) */etc/slurm/slurm.conf* without the comments:

root@server1:/etc/slurm# grep -v "#" slurm.conf
ClusterName=DlabClusterSlurmctldHost=server1GresTypes=gpuProctrackType=proctrack/linuxprocReturnToService=1SlurmctldPidFile=/var/run/slurmctld.pidSlurmctldPort=6817SlurmdPidFile=/var/run/slurmd.pidSlurmdPort=6818SlurmdSpoolDir=/var/spool/slurmdSlurmUser=rootStateSaveLocation=/var/spool/slurmctldTaskPlugin=task/affinity,task/cgroupInactiveLimit=0KillWait=30MinJobAge=300SlurmctldTimeout=120SlurmdTimeout=300Waittime=0SchedulerType=sched/backfillSelectType=select/cons_tresJobCompType=jobcomp/noneJobAcctGatherFrequency=30SlurmctldDebug=debug5SlurmctldLogFile=/var/log/slurmctld.logSlurmdDebug=debug5SlurmdLogFile=/var/log/slurmd.logNodeName=server[1-3]
RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2
State=UNKNOWN Gres=gpu:1PartitionName=mainPartition Nodes=ALL
Default=YES MaxTime=INFINITE State=UP


2) */etc/slurm/gres.conf*:

root@server1:/etc/slurm# cat gres.conf NodeName=server1 Name=gpu
File=/dev/nvidia0NodeName=server2 Name=gpu
File=/dev/nvidia0NodeName=server3 Name=gpu File=/dev/nvidia0

These files are the same on all 3 computers:

root@server1:/etc/slurm# diff slurm.conf <(ssh server2 "cat
/etc/slurm/slurm.conf")root@server1:/etc/slurm# diff slurm.conf <(ssh
server3 "cat /etc/slurm/slurm.conf")root@server1:/etc/slurm# diff
gres.conf <(ssh server2 "cat
/etc/slurm/gres.conf")root@server1:/etc/slurm# diff gres.conf <(ssh
server3 "cat /etc/slurm/gres.conf")root@server1:/etc/slurm#


Thank you,
Shookti
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to