*-O*, *--overcommit*
Overcommit resources. When applied to job allocation, only one CPU
is allocated to the job per node and options used to specify the
number of tasks per node, socket, core, etc. are ignored. When
applied to job step allocations (the *srun* command when executed
within an existing job allocation), this option can be used to
launch more than one task per CPU. Normally, *srun* will not
allocate more than one process per CPU. By specifying *--overcommit*
you are explicitly allowing more than one process per CPU. However
no more than *MAX_TASKS_PER_NODE* tasks are permitted to execute per
node. NOTE: *MAX_TASKS_PER_NODE* is defined in the file /slurm.h/
and is not a variable, it is set at Slurm build time.
I have used this successfully to run more jobs than cpus/cores avail.
-e.
Karl Lovink wrote:
Hello,
I am in the process of setting up our SLURM environment. We want to use
SLURM during our DDoS exercises for dispatching DDoS attack scripts. We
need a lot of parallel running jobs on a total of 3 nodes.I can't get it
to run more than 128 jobs simultaneously. There are 128 cpu's in the
compute nodes.
How can I ensure that I can run more jobs in parallel than there are
CPUs in the compute node?
Thanks
Karl
My srun script is:
srun --exclusive --nodes 3 --ntasks 384 /ddos/demo/showproc.sh
And my slurm.conf file:
ClusterName=ddos-cluster
ControlMachine=slurm
SlurmUser=ddos
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/opt/slurm/spool/ctld
SlurmdSpoolDir=/opt/slurm/spool/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/opt/slurm/run/.pid
SlurmdPidFile=/opt/slurm/run/slurmd.pid
ProctrackType=proctrack/pgid
PluginDir=/opt/slurm/lib/slurm
ReturnToService=2
TaskPlugin=task/none
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
SlurmctldDebug=3
SlurmctldLogFile=/opt/slurm/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/opt/slurm/log/slurmd.log
JobCompType=jobcomp/none
JobAcctGatherType=jobacct_gather/none
AccountingStorageTRES=gres/gpu
DebugFlags=CPU_Bind,gres
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=localhost
AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStorageUser=slurm
SlurmctldParameters=enable_configurable
GresTypes=gpu
DefMemPerNode=256000
NodeName=aivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
NodeName=mivd CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
NodeName=fiod CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
ThreadsPerCore=4 RealMemory=261562 State=UNKNOWN
PartitionName=ddos Nodes=ALL Default=YES MaxTime=INFINITE State=UP
PartitionName=adhoc Nodes=ALL Default=YES MaxTime=INFINITE State=UP
.