[slurm-users] Re: slurmd error: port already in use, resulting in slaves not being able to communicate with master slurmctld

2024-07-30 Thread Shooktija S N via slurm-users
This solved my problem: https://www.reddit.com/r/HPC/comments/1eb3f0g/comment/lfmed27/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button On Fri, Jul 26, 2024 at 3:37 PM Shooktija S N wrote: > Hi, > > I'm trying to set up a Slurm (version 22.05.8) cluster con

[slurm-users] slurmd error: port already in use, resulting in slaves not being able to communicate with master slurmctld

2024-07-26 Thread Shooktija S N via slurm-users
Hi, I'm trying to set up a Slurm (version 22.05.8) cluster consisting of 3 nodes with these hostnames and local IP addresses: server1 - 10.36.17.152 server2 - 10.36.17.166 server3 - 10.36.17.132 I had scrambled together a minimum working example using these resources: https://github.com/SergioMEV

[slurm-users] Error binding slurm stream socket: Address already in use, and GPU GRES verification

2024-07-23 Thread Shooktija S N via slurm-users
Hi, I am trying to set up Slurm with GPUs as GRES on a 3 node configuration (hostnames: server1, server2, server3). For a while everything looked fine and I was able to run srun --label --nodes=3 hostname which is what I use to test if Slurm is working correctly, and then it randomly stops. Tu

[slurm-users] GPU GRES verification and some really broad questions.

2024-05-03 Thread Shooktija S N via slurm-users
Hi, I am a complete slurm-admin and sys-admin noob trying to set up a 3 node Slurm cluster. I have managed to get a minimum working example running, in which I am able to use a GPU (NVIDIA GeForce RTX 4070 ti) as a GRES. This is *slurm.conf* without the comment lines: root@server1:/etc/slurm# gr

[slurm-users] Reserving resources for use by non-slurm stuff

2024-04-17 Thread Shooktija S N via slurm-users
Hi, I am running Slurm (v22.05.8) on 3 nodes each with the following specs: OS: Proxmox VE 8.1.4 x86_64 (based on Debian 12) CPU: AMD EPYC 7662 (128) GPU: NVIDIA GeForce RTX 4070 Ti Memory: 128 Gb This is /etc/slurm/slurm.conf on all 3 computers without the comment lines: ClusterName=DlabCluster S

[slurm-users] Re: How to reinstall / reconfigure Slurm?

2024-04-08 Thread Shooktija S N via slurm-users
Follow up: I was able to fix my problem following advice in this post which said that the GPU GRES could be manually configured (no autodetect) by adding a line like this: 'NodeName=slu

[slurm-users] Re: How to reinstall / reconfigure Slurm?

2024-04-04 Thread Shooktija S N via slurm-users
- libyaml-devel > > - lua-devel > > - make > > - mariadb-devel > > - munge-devel > > - munge-libs > > - ncurses-devel > > - numactl-devel > > - openssl-devel > > - pam-devel > > - per

[slurm-users] How to reinstall / reconfigure Slurm?

2024-04-03 Thread Shooktija S N via slurm-users
Hi, I am setting up Slurm on our lab's 3 node cluster and I have run into a problem while adding GPUs (each node has an NVIDIA 4070 ti) as a GRES. There is an error at the 'debug' log level in slurmd.log that says that the GPU is file-less and is being removed from the final GRES list. This error

[slurm-users] File-less NVIDIA GeForce 4070 Ti being removed from GRES list

2024-04-02 Thread Shooktija S N via slurm-users
Hi, I am trying to set up Slurm (version 22.05) on a 3 node cluster each having an NVIDIA GeForce RTX 4070 Ti GPU. I tried to follow along with the GRES setup tutorial on the Schedmd website and added the following (Gres=gpu:RTX4070TI:1) to the Node configuration in /etc/slurm/slurm.conf: NodeNam