[slurm-users] File-less NVIDIA GeForce 4070 Ti being removed from GRES list

2024-04-02 Thread Shooktija S N via slurm-users
Hi,

I am trying to set up Slurm (version 22.05) on a 3 node cluster each having
an NVIDIA GeForce RTX 4070 Ti GPU.
I tried to follow along with the GRES setup tutorial on the Schedmd website
and added the following (Gres=gpu:RTX4070TI:1) to the Node configuration in
/etc/slurm/slurm.conf:

NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64
ThreadsPerCore=2 State=UNKNOWN Gres=gpu:RTX4070TI:1

I do not have a gres.conf.
However, I see this line at the debug log level in /var/log/slurmd.log:

[2024-04-02T15:57:19.022] debug:  Removing file-less GPU gpu:RTX4070TI from
final GRES list

What other configs are necessary for Slurm to work with my GPU?

More information:
OS: Proxmox VE 8.1.4
Kernel: 6.5.13
CPU: AMD EPYC 7662
Memory: 128636MiB

/etc/slurm/slurm.conf that's shared by all the 3 nodes without the comment
lines:

ClusterName=DlabCluster
SlurmctldHost=server1
GresTypes=gpu
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/affinity,task/cgroup
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
SlurmctldDebug=debug
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=debug
SlurmdLogFile=/var/log/slurmd.log
NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64
ThreadsPerCore=2 State=UNKNOWN Gres=gpu:RTX4070TI:1
PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] How to reinstall / reconfigure Slurm?

2024-04-03 Thread Shooktija S N via slurm-users
Hi,

I am setting up Slurm on our lab's 3 node cluster and I have run into a
problem while adding GPUs (each node has an NVIDIA 4070 ti) as a GRES.
There is an error at the 'debug' log level in slurmd.log that says that the
GPU is file-less and is being removed from the final GRES list. This error
according to some older posts on this forum might be fixed by reinstalling
/ reconfiguring Slurm with the right flag (the '--with-nvml' flag according
to this  post).

Line in /var/log/slurmd.log:
[2024-04-03T15:42:02.695] debug:  Removing file-less GPU gpu:rtx4070 from
final GRES list

Does this error require me to either reinstall / reconfigure Slurm? What
does 'reconfigure Slurm' mean?
I'm about as clueless as a caveman with a smartphone when it comes to Slurm
administration and Linux system administration in general. So, if you
could, please explain it to me as simply as possible.

slurm.conf without comment lines:
ClusterName=DlabCluster
SlurmctldHost=server1
GresTypes=gpu
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/affinity,task/cgroup
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
SlurmctldDebug=debug2
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=debug2
SlurmdLogFile=/var/log/slurmd.log
NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64
ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4070:1
PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP

gres.conf (only one line):
AutoDetect=nvml

While installing cuda, I know that nvml has been installed because of this
line in /var/log/cuda-installer.log:
[INFO]: Installing: cuda-nvml-dev

Thanks!

PS: I could've added this as a continuation to this post
, but for some
reason I do not have permission to post to that group, so here I am
starting a new thread.

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: How to reinstall / reconfigure Slurm?

2024-04-04 Thread Shooktija S N via slurm-users
Thank you for the response, it certainly clears up a few things, and the
list of required packages is super helpful (where are these listed in the
docs?).

Here are a few follow up questions:

I had installed Slurm (version 22.05) using apt by running 'apt install
slurm-wlm'. Is it necessary to execute a command like 'apt-get autoremove
slurm-wlm' to compile the Slurm source code from scratch, as you've
described?

You have given this command as an example:
rpmbuild --define="_with_nvml --with-nvml=/usr" --define="_with_pam
--with-pam=/usr" --define="_with_pmix --with-pmix=/usr"
--define="_with_hdf5 --without-hdf5" --define="_with_ofed --without-ofed"
--define="_with_http_parser --with-http-parser=/usr/lib64"
--define="_with_yaml  --define="_with_jwt  --define="_with_slurmrestd
--with-slurmrestd=1" -ta slurm-$VERSION.tar.bz2 > build.log-$VERSION-`date
+%F` 2>&1

Are the options you've used in this example command fairly standard options
for a 'general' installation of Slurm? Where can I learn more about these
options to make sure that I don't miss any important options that might be
necessary for the specs of my cluster?

Would I have to add the paths to the compiled binaries to the PATH or
LD_LIBRARY_PATH environment variables?

My nodes are running an OS based on Debian 12 (Proxmox VE), what is the
'rpmbuild' equivalent for my OS? Would the syntax used in your example
command be the same for any build tool?

Thanks!


On Wed, Apr 3, 2024 at 9:18 PM Williams, Jenny Avis 
wrote:

> Slurm source code should be downloaded and recompiled including the
> configuration flag – with-nvml.
>
>
>
>
>
> As an example, using rpmbuild mechanism for recompiling and generating
> rpms, this is our current method.  Be aware that the compile works only if
> it finds the prerequisites needed for a given option on the host. (* e.g.
> to recompile this –with-nvml you should do so on a functioning gpu host *)
>
>
>
> 
>
>
>
> export VERSION=23.11.5
>
>
>
>
>
> wget https://download.schedmd.com/slurm/slurm-$VERSION.tar.bz2
>
> #
>
> rpmbuild --define="_with_nvml --with-nvml=/usr" --define="_with_pam
> --with-pam=/usr" --define="_with_pmix --with-pmix=/usr"
> --define="_with_hdf5 --without-hdf5" --define="_with_ofed --without-ofed"
> --define="_with_http_parser --with-http-parser=/usr/lib64"
> --define="_with_yaml  --define="_with_jwt  --define="_with_slurmrestd
> --with-slurmrestd=1" -ta slurm-$VERSION.tar.bz2 > build.log-$VERSION-`date
> +%F` 2>&1
>
>
>
>
>
> This is a list of packages we ensure are installed on a given node when
> running this compile .
>
>
>
> - pkgs:
>
>   - bzip2
>
>   - cuda-nvml-devel-12-2
>
>   - dbus-devel
>
>   - freeipmi
>
>   - freeipmi-devel
>
>   - gcc
>
>   - gtk2-devel
>
>   - hwloc-devel
>
>   - libjwt-devel
>
>   - libssh2-devel
>
>   - libyaml-devel
>
>       - lua-devel
>
>   - make
>
>   - mariadb-devel
>
>   - munge-devel
>
>   - munge-libs
>
>   - ncurses-devel
>
>   - numactl-devel
>
>   - openssl-devel
>
>   - pam-devel
>
>   - perl
>
>   - perl-ExtUtils-MakeMaker
>
>   - readline-devel
>
>   - rpm-build
>
>   - rpmdevtools
>
>   - rrdtool-devel
>
>   - http-parser-devel
>
>   - json-c-devel
>
>
>
> *From:* Shooktija S N via slurm-users 
> *Sent:* Wednesday, April 3, 2024 7:01 AM
> *To:* slurm-users@lists.schedmd.com
> *Subject:* [slurm-users] How to reinstall / reconfigure Slurm?
>
>
>
> Hi,
>
>
>
> I am setting up Slurm on our lab's 3 node cluster and I have run into a
> problem while adding GPUs (each node has an NVIDIA 4070 ti) as a GRES.
> There is an error at the 'debug' log level in slurmd.log that says that the
> GPU is file-less and is being removed from the final GRES list. This error
> according to some older posts on this forum might be fixed by reinstalling
> / reconfiguring Slurm with the right flag (the '--with-nvml' flag according
> to this <https://groups.google.com/g/slurm-users/c/cvGb4JnK8BY> post).
>
>
>
> Line in /var/log/slurmd.log:
>
> [2024-04-03T15:42:02.695] debug:  Removing file-less GPU gpu:rtx4070 from
> final GRES list
>
>
>
> Does this error require me to either reinstall / reconfigure Slurm? What
> does 'reconfigure Slurm' mean?
>
> I'm about as clue

[slurm-users] Re: How to reinstall / reconfigure Slurm?

2024-04-08 Thread Shooktija S N via slurm-users
Follow up:
I was able to fix my problem following advice in this post

which
said that the GPU GRES could be manually configured (no autodetect) by
adding a line like this: 'NodeName=slurmnode Name=gpu File=/dev/nvidia0' to
gres.conf

On Wed, Apr 3, 2024 at 4:30 PM Shooktija S N  wrote:

> Hi,
>
> I am setting up Slurm on our lab's 3 node cluster and I have run into a
> problem while adding GPUs (each node has an NVIDIA 4070 ti) as a GRES.
> There is an error at the 'debug' log level in slurmd.log that says that the
> GPU is file-less and is being removed from the final GRES list. This error
> according to some older posts on this forum might be fixed by reinstalling
> / reconfiguring Slurm with the right flag (the '--with-nvml' flag according
> to this  post).
>
> Line in /var/log/slurmd.log:
> [2024-04-03T15:42:02.695] debug:  Removing file-less GPU gpu:rtx4070 from
> final GRES list
>
> Does this error require me to either reinstall / reconfigure Slurm? What
> does 'reconfigure Slurm' mean?
> I'm about as clueless as a caveman with a smartphone when it comes to
> Slurm administration and Linux system administration in general. So, if you
> could, please explain it to me as simply as possible.
>
> slurm.conf without comment lines:
> ClusterName=DlabCluster
> SlurmctldHost=server1
> GresTypes=gpu
> ProctrackType=proctrack/linuxproc
> ReturnToService=1
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmctldPort=6817
> SlurmdPidFile=/var/run/slurmd.pid
> SlurmdPort=6818
> SlurmdSpoolDir=/var/spool/slurmd
> SlurmUser=root
> StateSaveLocation=/var/spool/slurmctld
> TaskPlugin=task/affinity,task/cgroup
> InactiveLimit=0
> KillWait=30
> MinJobAge=300
> SlurmctldTimeout=120
> SlurmdTimeout=300
> Waittime=0
> SchedulerType=sched/backfill
> SelectType=select/cons_tres
> JobCompType=jobcomp/none
> JobAcctGatherFrequency=30
> SlurmctldDebug=debug2
> SlurmctldLogFile=/var/log/slurmctld.log
> SlurmdDebug=debug2
> SlurmdLogFile=/var/log/slurmd.log
> NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64
> ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4070:1
> PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>
> gres.conf (only one line):
> AutoDetect=nvml
>
> While installing cuda, I know that nvml has been installed because of this
> line in /var/log/cuda-installer.log:
> [INFO]: Installing: cuda-nvml-dev
>
> Thanks!
>
> PS: I could've added this as a continuation to this post
> , but for some
> reason I do not have permission to post to that group, so here I am
> starting a new thread.
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Reserving resources for use by non-slurm stuff

2024-04-17 Thread Shooktija S N via slurm-users
Hi, I am running Slurm (v22.05.8) on 3 nodes each with the following specs:
OS: Proxmox VE 8.1.4 x86_64 (based on Debian 12)
CPU: AMD EPYC 7662 (128)
GPU: NVIDIA GeForce RTX 4070 Ti
Memory: 128 Gb

This is /etc/slurm/slurm.conf on all 3 computers without the comment lines:
ClusterName=DlabCluster
SlurmctldHost=server1
GresTypes=gpu
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/affinity,task/cgroup
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
SlurmctldDebug=debug3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=debug3
SlurmdLogFile=/var/log/slurmd.log
NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64
ThreadsPerCore=2 State=UNKNOWN Gres=gpu:1
PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP

I want to reserve a few cores and a few gigs of RAM for use only by the OS
which cannot be accessed by jobs being managed by Slurm. What configuration
do I need to do to achieve this?

Is it possible to reserve in a similar fashion a 'percent' of the GPU which
Slurm cannot exceed so that the OS has some GPU resources?

Is it possible to have these configs be different for each of the 3 nodes?

Thanks!

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] GPU GRES verification and some really broad questions.

2024-05-03 Thread Shooktija S N via slurm-users
Hi,

I am a complete slurm-admin and sys-admin noob trying to set up a 3 node
Slurm cluster. I have managed to get a minimum working example running, in
which I am able to use a GPU (NVIDIA GeForce RTX 4070 ti) as a GRES.

This is *slurm.conf* without the comment lines:

root@server1:/etc/slurm# grep -v "#"
slurm.confClusterName=DlabClusterSlurmctldHost=server1GresTypes=gpuProctrackType=proctrack/linuxprocReturnToService=1SlurmctldPidFile=/var/run/slurmctld.pidSlurmctldPort=6817SlurmdPidFile=/var/run/slurmd.pidSlurmdPort=6818SlurmdSpoolDir=/var/spool/slurmdSlurmUser=rootStateSaveLocation=/var/spool/slurmctldTaskPlugin=task/affinity,task/cgroupInactiveLimit=0KillWait=30MinJobAge=300SlurmctldTimeout=120SlurmdTimeout=300Waittime=0SchedulerType=sched/backfillSelectType=select/cons_tresJobCompType=jobcomp/noneJobAcctGatherFrequency=30SlurmctldDebug=infoSlurmctldLogFile=/var/log/slurmctld.logSlurmdDebug=debug3SlurmdLogFile=/var/log/slurmd.logNodeName=server[1-3]
RealMemory=128636 Sockets=1 CoresPerSocket=64 ThreadsPerCore=2
State=UNKNOWN Gres=gpu:1PartitionName=mainPartition Nodes=ALL
Default=YES MaxTime=INFINITE State=UP

This is *gres.conf* (only one line), each node has been assigned its
corresponding NodeName:

root@server1:/etc/slurm# cat gres.confNodeName=server1 Name=gpu
File=/dev/nvidia0

Those are the only config files I have.

I have a few general questions, loosely arranged in ascending order of
generality:

1) I have enabled the allocation of GPU resources as a GRES and have tested
this by running:

shookti@server1:~$ srun --nodes=3 --gpus=3 --label hostname2:
server30: server11: server2

Is this a good way to check if the configs have worked correctly? How else
can I easily check if the GPU GRES has been properly configured?

2) I want to reserve a few CPU cores, and a few gigs of memory for use by
non slurm related tasks. According to the documentation, I am to use
CoreSpecCount  and
MemSpecLimit  to
achieve this. The documentation for CoreSpecCount says "the Slurm daemon
slurmd may either be confined to these resources (the default) or prevented
from using these resources", how do I change this default behaviour to have
the config specify the cores reserved for non slurm stuff instead of
specifying how many cores slurm can use?

3) While looking up examples online on how to run Python scripts inside a
conda env, I have seen that the line 'module load conda' should be run
before running 'conda activate myEnv' in the sbatch submission script. The
command 'module' did not exist until I installed the apt package
'environment-modules', but now I see that conda is not listed as a module
that can be loaded when I check using the command 'module avail'. How do I
fix this?

4) A very broad question: while managing the resources being used by a
program, slurm might happen to split the resources across multiple
computers that might not necessarily have the files required by this
program to run. For example, a python script that requires the package
'numpy' to function but that package was not installed on all of the
computers. How are such things dealt with? Is the module approach meant to
fix this problem? In my previous question, if I had a python script that
users usually run just by running a command like 'python3 someScript.py'
instead of running it within a conda environment, how should I enable slurm
to manage the resources required by this script? Would I have to install
all the packages required by this script on all the computers that are in
the cluster?

5) Related to the previous question: I have set up my 3 nodes in such a way
that all the users' home directories are stored on a ceph cluster
 created using the hard
drives from all the 3 nodes, which essentially means that a user's home
directory is mounted at the same location on all 3 computers - making a
user's data visible to all 3 nodes. Does this make the process of managing
the dependencies of a program as described in the previous question easier?
I realise that programs having to read and write to files on the hard
drives of a ceph cluster is not really the fastest so I am planning on
having users use the /tmp/ directory for speed critical reading and
writing, as the OSs have been installed on NVME drives.

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Error binding slurm stream socket: Address already in use, and GPU GRES verification

2024-07-23 Thread Shooktija S N via slurm-users
Hi,

I am trying to set up Slurm with GPUs as GRES on a 3 node
configuration (hostnames: server1, server2, server3).

For a while everything looked fine and I was able to run

srun --label --nodes=3 hostname

which is what I use to test if Slurm is working correctly, and then it
randomly stops.

Turns out slurmctld is not working and it throws the following error (the
two lines are consecutive in the log file):

root@server1:/var/log# grep -i error slurmctld.log
[2024-07-22T14:47:32.302] error: Error binding slurm stream socket:
Address already in use[2024-07-22T14:47:32.302] fatal:
slurm_init_msg_engine_port error Address already in use

This error is being thrown after having made no changes to the config
files, in fact the cluster wasn't used at all for a few weeks before this
error was thrown.

This is the simple script I use to restart Slurm:

root@server1:~# cat slurmRestart.sh #! /bin/bashscp
/etc/slurm/slurm.conf server2:/etc/slurm/ && echo copied slurm.conf to
server2;scp /etc/slurm/slurm.conf server3:/etc/slurm/ && echo copied
slurm.conf to server3;rm /var/log/slurmd.log /var/log/slurmctld.log ;
systemctl restart slurmd slurmctld ; echo restarting slurm on
server1;(ssh server2 "rm /var/log/slurmd.log /var/log/slurmctld.log ;
systemctl restart slurmd slurmctld") && echo restarting slurm on
server2;(ssh server3 "rm /var/log/slurmd.log /var/log/slurmctld.log ;
systemctl restart slurmd slurmctld") && echo restarting slurm on
server3;

Could the error be due to the slurmd and/or slurmctld not being started in
the right order?

The other question I have is regarding the configuration of a GPU as a GRES
- how do I verify that it has been configured correctly? I was told to use
srun nvidia-smi with and without having enabled GPU use, but whether or not
I enable GPU usage has no effect on the output of the command:

root@server1:~# srun --nodes=1 nvidia-smi --query-gpu=uuid
--format=csvuuidGPU-55f127a8-dbf4-fd12-3cad-c0d5f2dcb005root@server1:~#
root@server1:~# srun --nodes=1 --gpus-per-node=1 nvidia-smi
--query-gpu=uuid
--format=csvuuidGPU-55f127a8-dbf4-fd12-3cad-c0d5f2dcb005

I am sceptical if about whether the GPU has properly been configured, is
this the best way to check if it has?

*The error:*
I first noticed this happening when I tried to run the command I usually
use to see if everything is fine, the srun command runs only one node, and
the only way to stop it if I specify the number of nodes as 3 is to press
Ctrl+C:

root@server1:~# srun --label --nodes=1 hostname0:
server1root@server1:~# ssh server2 "srun --label --nodes=1 hostname"0:
server1root@server1:~# ssh server3 "srun --label --nodes=1 hostname"0:
server1root@server1:~# srun --label --nodes=3 hostnamesrun: Required
node not available (down, drained or reserved)srun: job 265 queued and
waiting for resources^Csrun: Job allocation 265 has been revokedsrun:
Force Terminated JobId=265root@server1:~# ssh server2 "srun --label
--nodes=3 hostname"srun: Required node not available (down, drained or
reserved)srun: job 266 queued and waiting for
resources^Croot@server1:~# ssh server3 "srun --label --nodes=3
hostname"srun: Required node not available (down, drained or
reserved)srun: job 267 queued and waiting for resourcesroot@server1:~#


*The logs:*
1) The last 30 lines of */var/log/slurmctld.log* at the debug5 level in
server #1 (pastebin to the entire log ):

root@server1:/var/log# tail -30 slurmctld.log
[2024-07-22T14:47:32.301] debug:  Updating partition uid access
list[2024-07-22T14:47:32.301] debug3: create_mmap_buf: loaded file
`/var/spool/slurmctld/resv_state` as buf_t[2024-07-22T14:47:32.301]
debug3: Version string in resv_state header is
PROTOCOL_VERSION[2024-07-22T14:47:32.301] Recovered state of 0
reservations[2024-07-22T14:47:32.301] debug3: create_mmap_buf: loaded
file `/var/spool/slurmctld/trigger_state` as
buf_t[2024-07-22T14:47:32.301] State of 0 triggers
recovered[2024-07-22T14:47:32.301] read_slurm_conf: backup_controller
not specified[2024-07-22T14:47:32.301] select/cons_tres:
select_p_reconfigure: select/cons_tres:
reconfigure[2024-07-22T14:47:32.301] select/cons_tres:
part_data_create_array: select/cons_tres: preparing for 1
partitions[2024-07-22T14:47:32.301] debug:  power_save module
disabled, SuspendTime < 0[2024-07-22T14:47:32.301] Running as primary
controller[2024-07-22T14:47:32.301] debug:  No backup controllers, not
launching heartbeat.[2024-07-22T14:47:32.301] debug3: Trying to load
plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/priority_basic.so[2024-07-22T14:47:32.301]
debug3: plugin_load_from_file->_verify_syms: found Slurm plugin
name:Priority BASIC plugin type:priority/basic
version:0x160508[2024-07-22T14:47:32.301] debug:  priority/basic:
init: Priority BASIC plugin loaded[2024-07-22T14:47:32.301] debug3:
Success.[2024-07-22T14:47:32.301] No parameter for mcs plugin, default
values set[2024-07-22T14:47:32.301] mcs: MCSParameters = (null).
ondemand set.[2024-07-22T14:47:32.301] debug3:

[slurm-users] slurmd error: port already in use, resulting in slaves not being able to communicate with master slurmctld

2024-07-26 Thread Shooktija S N via slurm-users
Hi,

I'm trying to set up a Slurm (version 22.05.8) cluster consisting of 3
nodes with these hostnames and local IP addresses:
server1 - 10.36.17.152
server2 - 10.36.17.166
server3 - 10.36.17.132

I had scrambled together a minimum working example using these resources:
https://github.com/SergioMEV/slurm-for-dummies
https://blog.devops.dev/slurm-complete-guide-a-to-z-concepts-setup-and-trouble-shooting-for-admins-8dc5034ed65b

For a while everything looked fine and I was able to run the command I
usually use to see if everything is fine:

srun --label --nodes=3 hostname

Which used to show the expected output of the hostnames of all 3 computers,
namely: server1, server2, and server3.

However - after having made no changes to the configs - the command no
longer works if I specify the number of nodes as anything more than 1, this
behaviour is consistent on all 3 computers, the output of 'sinfo' is also
included below:

root@server1:~# srun --nodes=1 hostnameserver1root@server1:~#
root@server1:~# srun --nodes=3 hostnamesrun: Required node not
available (down, drained or reserved)srun: job 312 queued and waiting
for resources^Csrun: Job allocation 312 has been revokedsrun: Force
Terminated JobId=312root@server1:~# root@server1:~# ssh server2 "srun
--nodes=1 hostname"server1root@server1:~# root@server1:~# ssh server2
"srun --nodes=3 hostname"srun: Required node not available (down,
drained or reserved)srun: job 314 queued and waiting for
resources^Croot@server1:~# root@server1:~# root@server1:~#
sinfoPARTITION  AVAIL  TIMELIMIT  NODES  STATE
NODELISTmainPartition*up   infinite  2  down*
server[2-3]mainPartition*up   infinite  1   idle
server1root@server1:~#

Turns out, slurmctld on the master node (hostname: server1) and slurmd on
the slave nodes (hostnames: server2 & server3) are throwing some errors
probably related to networking:
A few lines before and after the first occurence of the error in
slurmctld.log on the master node - it's the only type of error I have
noticed in the logs (pastebin to the entire log
):

root@server1:/var/log# grep -B 20 -A 5 -m1 -i "error"
slurmctld.log[2024-07-26T13:13:49.579] select/cons_tres:
part_data_create_array: select/cons_tres: preparing for 1
partitions[2024-07-26T13:13:49.580] debug:  power_save module
disabled, SuspendTime < 0[2024-07-26T13:13:49.580] Running as primary
controller[2024-07-26T13:13:49.580] debug:  No backup controllers, not
launching heartbeat.[2024-07-26T13:13:49.580] debug:  priority/basic:
init: Priority BASIC plugin loaded[2024-07-26T13:13:49.580] No
parameter for mcs plugin, default values set[2024-07-26T13:13:49.580]
mcs: MCSParameters = (null). ondemand set.[2024-07-26T13:13:49.580]
debug:  mcs/none: init: mcs none plugin
loaded[2024-07-26T13:13:49.580] debug2: slurmctld listening on
0.0.0.0:6817[2024-07-26T13:13:52.662] debug:  hash/k12: init: init:
KangarooTwelve hash plugin loaded[2024-07-26T13:13:52.662] debug2:
Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from
UID=0[2024-07-26T13:13:52.662] debug:  gres/gpu: init:
loaded[2024-07-26T13:13:52.662] debug:  validate_node_specs: node
server1 registered with 0 jobs[2024-07-26T13:13:52.662] debug2:
_slurm_rpc_node_registration complete for server1
usec=229[2024-07-26T13:13:53.586] debug:  Spawning registration agent
for server[2-3] 2 hosts[2024-07-26T13:13:53.586]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2[2024-07-26T13:13:53.586]
debug:  sched: Running job scheduler for default
depth.[2024-07-26T13:13:53.586] debug2: Spawning RPC agent for
msg_type REQUEST_NODE_REGISTRATION_STATUS[2024-07-26T13:13:53.587]
debug2: Tree head got back 0 looking for 2[2024-07-26T13:13:53.588]
debug2: _slurm_connect: failed to connect to 10.36.17.166:6818:
Connection refused[2024-07-26T13:13:53.588] debug2: Error connecting
slurm stream socket at 10.36.17.166:6818: Connection
refused[2024-07-26T13:13:53.588] debug2: _slurm_connect: failed to
connect to 10.36.17.132:6818: Connection
refused[2024-07-26T13:13:53.588] debug2: Error connecting slurm stream
socket at 10.36.17.132:6818: Connection
refused[2024-07-26T13:13:54.588] debug2: _slurm_connect: failed to
connect to 10.36.17.166:6818: Connection
refused[2024-07-26T13:13:54.588] debug2: Error connecting slurm stream
socket at 10.36.17.166:6818: Connection
refused[2024-07-26T13:13:54.589] debug2: _slurm_connect: failed to
connect to 10.36.17.132:6818: Connection refused

The connections to 10.36.17.166:6818 and 10.36.17.132:6818 are refused.
Those are ports specified by the 'SlurmdPort' key in slurm.conf

There are similar errors in the slurmd.log files on both the slave nodes as
well:
slurmd.log on server2, the error is only at the end of the file (pastebin
to the entire log ):

root@server2:/var/log# tail -5 slurmd.log [2024-07-26T13:13:53.018]
debug:  mpi/pmix_v4: init: PMIx plugin loade

[slurm-users] Re: slurmd error: port already in use, resulting in slaves not being able to communicate with master slurmctld

2024-07-30 Thread Shooktija S N via slurm-users
This solved my problem:
https://www.reddit.com/r/HPC/comments/1eb3f0g/comment/lfmed27/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

On Fri, Jul 26, 2024 at 3:37 PM Shooktija S N  wrote:

> Hi,
>
> I'm trying to set up a Slurm (version 22.05.8) cluster consisting of 3
> nodes with these hostnames and local IP addresses:
> server1 - 10.36.17.152
> server2 - 10.36.17.166
> server3 - 10.36.17.132
>
> I had scrambled together a minimum working example using these resources:
> https://github.com/SergioMEV/slurm-for-dummies
>
> https://blog.devops.dev/slurm-complete-guide-a-to-z-concepts-setup-and-trouble-shooting-for-admins-8dc5034ed65b
>
> For a while everything looked fine and I was able to run the command I
> usually use to see if everything is fine:
>
> srun --label --nodes=3 hostname
>
> Which used to show the expected output of the hostnames of all 3
> computers, namely: server1, server2, and server3.
>
> However - after having made no changes to the configs - the command no
> longer works if I specify the number of nodes as anything more than 1, this
> behaviour is consistent on all 3 computers, the output of 'sinfo' is also
> included below:
>
> root@server1:~# srun --nodes=1 hostnameserver1root@server1:~# root@server1:~# 
> srun --nodes=3 hostnamesrun: Required node not available (down, drained or 
> reserved)srun: job 312 queued and waiting for resources^Csrun: Job allocation 
> 312 has been revokedsrun: Force Terminated JobId=312root@server1:~# 
> root@server1:~# ssh server2 "srun --nodes=1 hostname"server1root@server1:~# 
> root@server1:~# ssh server2 "srun --nodes=3 hostname"srun: Required node not 
> available (down, drained or reserved)srun: job 314 queued and waiting for 
> resources^Croot@server1:~# root@server1:~# root@server1:~# sinfoPARTITION 
>  AVAIL  TIMELIMIT  NODES  STATE NODELISTmainPartition*up   infinite  
> 2  down* server[2-3]mainPartition*up   infinite  1   idle 
> server1root@server1:~#
>
> Turns out, slurmctld on the master node (hostname: server1) and slurmd on
> the slave nodes (hostnames: server2 & server3) are throwing some errors
> probably related to networking:
> A few lines before and after the first occurence of the error in
> slurmctld.log on the master node - it's the only type of error I have
> noticed in the logs (pastebin to the entire log
> ):
>
> root@server1:/var/log# grep -B 20 -A 5 -m1 -i "error" 
> slurmctld.log[2024-07-26T13:13:49.579] select/cons_tres: 
> part_data_create_array: select/cons_tres: preparing for 1 
> partitions[2024-07-26T13:13:49.580] debug:  power_save module disabled, 
> SuspendTime < 0[2024-07-26T13:13:49.580] Running as primary 
> controller[2024-07-26T13:13:49.580] debug:  No backup controllers, not 
> launching heartbeat.[2024-07-26T13:13:49.580] debug:  priority/basic: init: 
> Priority BASIC plugin loaded[2024-07-26T13:13:49.580] No parameter for mcs 
> plugin, default values set[2024-07-26T13:13:49.580] mcs: MCSParameters = 
> (null). ondemand set.[2024-07-26T13:13:49.580] debug:  mcs/none: init: mcs 
> none plugin loaded[2024-07-26T13:13:49.580] debug2: slurmctld listening on 
> 0.0.0.0:6817[2024-07-26T13:13:52.662] debug:  hash/k12: init: init: 
> KangarooTwelve hash plugin loaded[2024-07-26T13:13:52.662] debug2: Processing 
> RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0[2024-07-26T13:13:52.662] 
> debug:  gres/gpu: init: loaded[2024-07-26T13:13:52.662] debug:  
> validate_node_specs: node server1 registered with 0 
> jobs[2024-07-26T13:13:52.662] debug2: _slurm_rpc_node_registration complete 
> for server1 usec=229[2024-07-26T13:13:53.586] debug:  Spawning registration 
> agent for server[2-3] 2 hosts[2024-07-26T13:13:53.586] 
> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2[2024-07-26T13:13:53.586]
>  debug:  sched: Running job scheduler for default 
> depth.[2024-07-26T13:13:53.586] debug2: Spawning RPC agent for msg_type 
> REQUEST_NODE_REGISTRATION_STATUS[2024-07-26T13:13:53.587] debug2: Tree head 
> got back 0 looking for 2[2024-07-26T13:13:53.588] debug2: _slurm_connect: 
> failed to connect to 10.36.17.166:6818: Connection 
> refused[2024-07-26T13:13:53.588] debug2: Error connecting slurm stream socket 
> at 10.36.17.166:6818: Connection refused[2024-07-26T13:13:53.588] debug2: 
> _slurm_connect: failed to connect to 10.36.17.132:6818: Connection 
> refused[2024-07-26T13:13:53.588] debug2: Error connecting slurm stream socket 
> at 10.36.17.132:6818: Connection refused[2024-07-26T13:13:54.588] debug2: 
> _slurm_connect: failed to connect to 10.36.17.166:6818: Connection 
> refused[2024-07-26T13:13:54.588] debug2: Error connecting slurm stream socket 
> at 10.36.17.166:6818: Connection refused[2024-07-26T13:13:54.589] debug2: 
> _slurm_connect: failed to connect to 10.36.17.132:6818: Connection refused
>
> The connections to 10.36.17.16