Hello, this is my node configuration:
NodeName=slurm-gpu-1 NodeAddr=192.168.0.200 Procs=16 Gres=gpu:2 State=UNKNOWN NodeName=slurm-gpu-2 NodeAddr=192.168.0.124 Procs=1 Gres=gpu:0 State=UNKNOWN PartitionName=gpu Nodes=slurm-gpu-1 Default=NO MaxTime=INFINITE AllowAccounts=whitelist,gpu_users State=UP PartitionName=compute Nodes=slurm-gpu-1,slurm-gpu-2 Default=YES MaxTime=INFINITE AllowAccounts=whitelist State=UP and this is one of the job scripts. You can see mem is set to 1M, so very minimal. #!/bin/bash #SBATCH -J Test1 #SBATCH --nodelist=slurm-gpu-1 #SBATCH --mem=1M #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH -o /home/centos/Test1-%j.out #SBATCH -e /home/centos/Test1-%j.err srun sleep 60 Thanks, Durai On Wed, Aug 26, 2020 at 2:49 AM Jacqueline Scoggins <jscogg...@lbl.gov> wrote: > What is the variable for Oversubscribe is set for your partitions? By > default Oversubscribe=No which means that none of your Cores will be shared > with other jobs. With oversubscribe set to Yes or Force you should set a > number after the FORCE to allow the number of jobs that can run on each > core of each node in the partition. > Look at this page for a better understanding: > https://slurm.schedmd.com/cons_res_share.html#:~:text=OverSubscribe%3DYES-,By%20default%20same%20as%20OverSubscribe%3DNO.,the%20srun%20%2D%2Doversubscribe%20option.&text=Each%20core%20can%20be%20allocated,default%204%20jobs%20per%20core).&text=CPUs%20are%20allocated%20to%20jobs > . > > You can also check the oversubscribe on a partition using sinfo -o "%h" > option. > sinfo -o '%P %.5a %.10h %N ' | head > > PARTITION AVAIL OVERSUBSCR NODELIST > > > Look at the sinfo options for further details. > > > Jackie > > On Tue, Aug 25, 2020 at 9:58 AM Durai Arasan <arasan.du...@gmail.com> > wrote: > >> Hello, >> >> On our cluster we have SelectTypeParameters set to "CR_Core_Memory". >> >> Under these conditions multiple jobs should be able to run on the same >> node. But they refuse to be allocated on the same node and only one job >> runs on the node and rest of the jobs are in pending state. >> >> When we changed SelectTypeParameters to "CR_Core" however, this issue was >> resolved and multiple jobs were successfully allocated to the same node and >> ran concurrently on the same node. >> >> Does anyone know why such behavior is seen? Why does including memory as >> consumable resource lead to node exclusive behavior? >> >> Thanks, >> Durai >> >>