Hello Durai,
you did not specify the amount of memory in your node configuration.
Perhaps it defaults to 1MB and so your 1MB-job already uses all the
memory that the scheduler thinks the node has...?
What does "scontrol show node slurm-gpu-1" say? Look for the
"RealMemory" field in the output.
Best,
Christoph
On 26/08/2020 11.35, Durai Arasan wrote:
Hello,
this is my node configuration:
NodeName=slurm-gpu-1 NodeAddr=192.168.0.200 Procs=16 Gres=gpu:2
State=UNKNOWN
NodeName=slurm-gpu-2 NodeAddr=192.168.0.124 Procs=1 Gres=gpu:0
State=UNKNOWN
PartitionName=gpu Nodes=slurm-gpu-1 Default=NO MaxTime=INFINITE
AllowAccounts=whitelist,gpu_users State=UP
PartitionName=compute Nodes=slurm-gpu-1,slurm-gpu-2 Default=YES
MaxTime=INFINITE AllowAccounts=whitelist State=UP
and this is one of the job scripts. You can see mem is set to 1M, so
very minimal.
#!/bin/bash
#SBATCH -J Test1
#SBATCH --nodelist=slurm-gpu-1
#SBATCH --mem=1M
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH -o /home/centos/Test1-%j.out
#SBATCH -e /home/centos/Test1-%j.err
srun sleep 60
Thanks,
Durai
On Wed, Aug 26, 2020 at 2:49 AM Jacqueline Scoggins <jscogg...@lbl.gov
<mailto:jscogg...@lbl.gov>> wrote:
What is the variable for Oversubscribe is set for your partitions?
By default Oversubscribe=No which means that none of your Cores will
be shared with other jobs. With oversubscribe set to Yes or Force
you should set a number after the FORCE to allow the number of jobs
that can run on each core of each node in the partition.
Look at this page for a better understanding:
https://slurm.schedmd.com/cons_res_share.html#:~:text=OverSubscribe%3DYES-,By%20default%20same%20as%20OverSubscribe%3DNO.,the%20srun%20%2D%2Doversubscribe%20option.&text=Each%20core%20can%20be%20allocated,default%204%20jobs%20per%20core).&text=CPUs%20are%20allocated%20to%20jobs.
You can also check the oversubscribe on a partition using sinfo -o
"%h" option.
sinfo -o '%P %.5a %.10h %N ' | head
PARTITION AVAIL OVERSUBSCR NODELIST
Look at the sinfo options for further details.
Jackie
On Tue, Aug 25, 2020 at 9:58 AM Durai Arasan <arasan.du...@gmail.com
<mailto:arasan.du...@gmail.com>> wrote:
Hello,
On our cluster we have SelectTypeParameters set to "CR_Core_Memory".
Under these conditions multiple jobs should be able to run on
the same node. But they refuse to be allocated on the same node
and only one job runs on the node and rest of the jobs are in
pending state.
When we changed SelectTypeParameters to "CR_Core" however, this
issue was resolved and multiple jobs were successfully allocated
to the same node and ran concurrently on the same node.
Does anyone know why such behavior is seen? Why does including
memory as consumable resource lead to node exclusive behavior?
Thanks,
Durai
--
Dr. Christoph Brüning
Universität Würzburg
Rechenzentrum
Am Hubland
D-97074 Würzburg
Tel.: +49 931 31-80499