@Rodrigo Santibáñez Please see the updated setup related to my question. I have compiled in the latest tag `slurm-21-08-4-1` using `./configure --enable-debug --enable-front-end` in order to test in (`non-emulator mode`):
``` git checkout slurm-21-08-4-1 ./configure --enable-debug --enable-multiple-slurmd make sudo make install ``` I have following lines in my `/usr/local/etc/slurm.conf` file: For: ``` NodeName=home NodeAddr=127.0.0.1 CPUs=4 PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP ``` I get the following error: `slurmd: fatal: Frontend not configured correctly in slurm.conf. See FrontEndName in slurm.conf man page.` When I try: ``` FrontEndName=home ``` I get following error message: ``` $ sudo slurmd -Dvvv slurmd: debug: Log file re-opened slurmd: error: _find_node_record: lookup failure for node "home" slurmd: error: _find_node_record: lookup failure for node "home", alias "home" slurmd: error: slurmd initialization failed ``` Than when I have tried following: ``` FrontEndName=127.0.0.1 NodeName=home NodeHostName=localhost NodeAddr=127.0.0.1 CPUs=4 PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP ``` which keeps all the submitted jobs in a pending state with following error message: `slurmctld: error: _slurm_rpc_node_registration node=home: Invalid node name specified` On Sun, Nov 14, 2021 at 3:09 PM Alper Alimoglu <alper.alimo...@gmail.com> wrote: > @Rodrigo Santibáñez I think I was not able to clarify my question. > > I am able to successfully run `slurm` that has versions lower than 20, > such as `19-05-8-1`. But with the same configuration slurm that has > version 20 or higher does not properly work. > So I get lost to figure out correct configuration structure to work on > the latest stable slurm version. > > > > > On Sun, Nov 14, 2021 at 2:11 AM Rodrigo Santibáñez < > rsantibanez.uch...@gmail.com> wrote: > >> Hi Alper, >> >> Maybe this is relevant to you: >> >> *Can Slurm emulate nodes with more resources than physically exist on the >> node?* >> Yes. In the slurm.conf file, configure >> *SlurmdParameters=config_overrides* and specify any desired node >> resource specifications (*CPUs*, *Sockets*, *CoresPerSocket*, >> *ThreadsPerCore*, and/or *TmpDisk*). Slurm will use the resource >> specification for each node that is given in *slurm.conf* and will not >> check these specifications against those actually found on the node. The >> system would best be configured with *TaskPlugin=task/none*, so that >> launched tasks can run on any available CPU under operating system control. >> >> Best >> >> On Sat, Nov 13, 2021 at 4:10 AM Alper Alimoglu <alper.alimo...@gmail.com> >> wrote: >> >>> My goal is to set up a single server `slurm` cluster (only using a >>> single computer) that can run multiple jobs in parallel. >>> >>> In my node `nproc` returns 4 so I believe I can run 4 jobs in parallel >>> if they use a single core. In order to do it I run the controller and the >>> worker daemon on the same node. >>> When I submit four jobs at the same time, only one of them is able to >>> run and the other three are not able to run due to the following error: >>> `queued and waiting for resources`. >>> >>> I am using `Ubuntu 20.04.3 LTS"`. I have observe that this approach was >>> working on tag version `<=19`: >>> >>> ``` >>> $ git clone https://github.com/SchedMD/slurm ~/slurm && cd ~/slurm >>> $ git checkout e2e21cb571ce88a6dd52989ec6fe30da8c4ef15f >>> #slurm-19-05-8-1` >>> $ ./configure --enable-debug --enable-front-end --enable-multiple-slurmd >>> $ sudo make && sudo make install >>> ``` >>> >>> but does not work on higher versions like `slurm 20.02.1` or its >>> `master` branch. >>> >>> ------ >>> >>> ``` >>> ❯ sinfo >>> Sat Nov 06 14:17:04 2021 >>> NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK >>> WEIGHT AVAIL_FE REASON >>> home1 1 debug* idle 1 1:1:1 1 0 >>> 1 (null) none >>> home2 1 debug* idle 1 1:1:1 1 0 >>> 1 (null) none >>> home3 1 debug* idle 1 1:1:1 1 0 >>> 1 (null) none >>> home4 1 debug* idle 1 1:1:1 1 0 >>> 1 (null) none >>> $ srun -N1 sleep 10 # runs >>> $ srun -N1 sleep 10 # queued and waiting for resources >>> $ srun -N1 sleep 10 # queued and waiting for resources >>> $ srun -N1 sleep 10 # queued and waiting for resources >>> ``` >>> >>> Here, I get lost where since its [emulate-mode][1] they should be able >>> to run in parallel. >>> >>> >>> They way I build from the source-code: >>> >>> ```bash >>> git clone https://github.com/SchedMD/slurm ~/slurm && cd ~/slurm >>> ./configure --enable-debug --enable-multiple-slurmd >>> make >>> sudo make install >>> ``` >>> >>> -------- >>> >>> ``` >>> $ hostname -s >>> home >>> $ nproc >>> 4 >>> ``` >>> >>> ##### Compute_node setup: >>> >>> ``` >>> NodeName=home[1-4] NodeHostName=home NodeAddr=127.0.0.1 CPUs=1 >>> ThreadsPerCore=1 Port=17001 >>> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP >>> Shared=FORCE:1 >>> ``` >>> >>> I have also tried: `NodeHostName=localhost` >>> >>> `slurm.conf` file: >>> >>> ```bash >>> ControlMachine=home # $(hostname -s) >>> ControlAddr=127.0.0.1 >>> ClusterName=cluster >>> SlurmUser=alper >>> MailProg=/home/user/slurm_mail_prog.sh >>> MinJobAge=172800 # 48 h >>> SlurmdSpoolDir=/var/spool/slurmd >>> SlurmdLogFile=/var/log/slurm/slurmd.%n.log >>> SlurmdPidFile=/var/run/slurmd.%n.pid >>> AuthType=auth/munge >>> CryptoType=crypto/munge >>> MpiDefault=none >>> ProctrackType=proctrack/pgid >>> ReturnToService=1 >>> SlurmctldPidFile=/var/run/slurmctld.pid >>> SlurmdPort=6820 >>> SlurmctldPort=6821 >>> StateSaveLocation=/tmp/slurmstate >>> SwitchType=switch/none >>> TaskPlugin=task/none >>> InactiveLimit=0 >>> Waittime=0 >>> SchedulerType=sched/backfill >>> SelectType=select/linear >>> PriorityDecayHalfLife=0 >>> PriorityUsageResetPeriod=NONE >>> AccountingStorageEnforce=limits >>> AccountingStorageType=accounting_storage/slurmdbd >>> AccountingStoreFlags=YES >>> JobCompType=jobcomp/none >>> JobAcctGatherFrequency=30 >>> JobAcctGatherType=jobacct_gather/none >>> NodeName=home[1-2] NodeHostName=home NodeAddr=127.0.0.1 CPUs=2 >>> ThreadsPerCore=1 Port=17001 >>> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP >>> Shared=FORCE:1 >>> ``` >>> >>> `slurmdbd.conf`: >>> >>> ```bash >>> AuthType=auth/munge >>> AuthInfo=/var/run/munge/munge.socket.2 >>> DbdAddr=localhost >>> DbdHost=localhost >>> SlurmUser=alper >>> DebugLevel=4 >>> LogFile=/var/log/slurm/slurmdbd.log >>> PidFile=/var/run/slurmdbd.pid >>> StorageType=accounting_storage/mysql >>> StorageUser=alper >>> StoragePass=12345 >>> ``` >>> >>> The way I run slurm: >>> >>> ``` >>> sudo /usr/local/sbin/slurmd >>> sudo /usr/local/sbin/slurmdbd & >>> sudo /usr/local/sbin/slurmctld -cDvvvvvv >>> ``` >>> --------- >>> >>> Related: >>> - minimum number of computers for a slurm cluster ( >>> https://stackoverflow.com/a/27788311/2402577) >>> - [Running multiple worker daemons SLURM]( >>> https://stackoverflow.com/a/40707189/2402577) >>> - https://stackoverflow.com/a/47009930/2402577 >>> >>> >>> [1]: https://slurm.schedmd.com/faq.html#multi_slurmd >>> >>