@Rodrigo Santibáñez I think I was not able to clarify my question. I am able to successfully run `slurm` that has versions lower than 20, such as `19-05-8-1`. But with the same configuration slurm that has version 20 or higher does not properly work. So I get lost to figure out correct configuration structure to work on the latest stable slurm version.
On Sun, Nov 14, 2021 at 2:11 AM Rodrigo Santibáñez < rsantibanez.uch...@gmail.com> wrote: > Hi Alper, > > Maybe this is relevant to you: > > *Can Slurm emulate nodes with more resources than physically exist on the > node?* > Yes. In the slurm.conf file, configure *SlurmdParameters=config_overrides* > and specify any desired node resource specifications (*CPUs*, *Sockets*, > *CoresPerSocket*, *ThreadsPerCore*, and/or *TmpDisk*). Slurm will use the > resource specification for each node that is given in *slurm.conf* and > will not check these specifications against those actually found on the > node. The system would best be configured with *TaskPlugin=task/none*, so > that launched tasks can run on any available CPU under operating system > control. > > Best > > On Sat, Nov 13, 2021 at 4:10 AM Alper Alimoglu <alper.alimo...@gmail.com> > wrote: > >> My goal is to set up a single server `slurm` cluster (only using a single >> computer) that can run multiple jobs in parallel. >> >> In my node `nproc` returns 4 so I believe I can run 4 jobs in parallel if >> they use a single core. In order to do it I run the controller and the >> worker daemon on the same node. >> When I submit four jobs at the same time, only one of them is able to run >> and the other three are not able to run due to the following error: `queued >> and waiting for resources`. >> >> I am using `Ubuntu 20.04.3 LTS"`. I have observe that this approach was >> working on tag version `<=19`: >> >> ``` >> $ git clone https://github.com/SchedMD/slurm ~/slurm && cd ~/slurm >> $ git checkout e2e21cb571ce88a6dd52989ec6fe30da8c4ef15f >> #slurm-19-05-8-1` >> $ ./configure --enable-debug --enable-front-end --enable-multiple-slurmd >> $ sudo make && sudo make install >> ``` >> >> but does not work on higher versions like `slurm 20.02.1` or its `master` >> branch. >> >> ------ >> >> ``` >> ❯ sinfo >> Sat Nov 06 14:17:04 2021 >> NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK >> WEIGHT AVAIL_FE REASON >> home1 1 debug* idle 1 1:1:1 1 0 >> 1 (null) none >> home2 1 debug* idle 1 1:1:1 1 0 >> 1 (null) none >> home3 1 debug* idle 1 1:1:1 1 0 >> 1 (null) none >> home4 1 debug* idle 1 1:1:1 1 0 >> 1 (null) none >> $ srun -N1 sleep 10 # runs >> $ srun -N1 sleep 10 # queued and waiting for resources >> $ srun -N1 sleep 10 # queued and waiting for resources >> $ srun -N1 sleep 10 # queued and waiting for resources >> ``` >> >> Here, I get lost where since its [emulate-mode][1] they should be able to >> run in parallel. >> >> >> They way I build from the source-code: >> >> ```bash >> git clone https://github.com/SchedMD/slurm ~/slurm && cd ~/slurm >> ./configure --enable-debug --enable-multiple-slurmd >> make >> sudo make install >> ``` >> >> -------- >> >> ``` >> $ hostname -s >> home >> $ nproc >> 4 >> ``` >> >> ##### Compute_node setup: >> >> ``` >> NodeName=home[1-4] NodeHostName=home NodeAddr=127.0.0.1 CPUs=1 >> ThreadsPerCore=1 Port=17001 >> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP >> Shared=FORCE:1 >> ``` >> >> I have also tried: `NodeHostName=localhost` >> >> `slurm.conf` file: >> >> ```bash >> ControlMachine=home # $(hostname -s) >> ControlAddr=127.0.0.1 >> ClusterName=cluster >> SlurmUser=alper >> MailProg=/home/user/slurm_mail_prog.sh >> MinJobAge=172800 # 48 h >> SlurmdSpoolDir=/var/spool/slurmd >> SlurmdLogFile=/var/log/slurm/slurmd.%n.log >> SlurmdPidFile=/var/run/slurmd.%n.pid >> AuthType=auth/munge >> CryptoType=crypto/munge >> MpiDefault=none >> ProctrackType=proctrack/pgid >> ReturnToService=1 >> SlurmctldPidFile=/var/run/slurmctld.pid >> SlurmdPort=6820 >> SlurmctldPort=6821 >> StateSaveLocation=/tmp/slurmstate >> SwitchType=switch/none >> TaskPlugin=task/none >> InactiveLimit=0 >> Waittime=0 >> SchedulerType=sched/backfill >> SelectType=select/linear >> PriorityDecayHalfLife=0 >> PriorityUsageResetPeriod=NONE >> AccountingStorageEnforce=limits >> AccountingStorageType=accounting_storage/slurmdbd >> AccountingStoreFlags=YES >> JobCompType=jobcomp/none >> JobAcctGatherFrequency=30 >> JobAcctGatherType=jobacct_gather/none >> NodeName=home[1-2] NodeHostName=home NodeAddr=127.0.0.1 CPUs=2 >> ThreadsPerCore=1 Port=17001 >> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP >> Shared=FORCE:1 >> ``` >> >> `slurmdbd.conf`: >> >> ```bash >> AuthType=auth/munge >> AuthInfo=/var/run/munge/munge.socket.2 >> DbdAddr=localhost >> DbdHost=localhost >> SlurmUser=alper >> DebugLevel=4 >> LogFile=/var/log/slurm/slurmdbd.log >> PidFile=/var/run/slurmdbd.pid >> StorageType=accounting_storage/mysql >> StorageUser=alper >> StoragePass=12345 >> ``` >> >> The way I run slurm: >> >> ``` >> sudo /usr/local/sbin/slurmd >> sudo /usr/local/sbin/slurmdbd & >> sudo /usr/local/sbin/slurmctld -cDvvvvvv >> ``` >> --------- >> >> Related: >> - minimum number of computers for a slurm cluster ( >> https://stackoverflow.com/a/27788311/2402577) >> - [Running multiple worker daemons SLURM]( >> https://stackoverflow.com/a/40707189/2402577) >> - https://stackoverflow.com/a/47009930/2402577 >> >> >> [1]: https://slurm.schedmd.com/faq.html#multi_slurmd >> >