Re: [slurm-users] [Support] SLURM launching jobs onto nodes with suspended jobs may lead to resource contention

SJTU Wed, 16 Sep 2020 20:19:05 -0700

Thank you, Paul. I'll try this workaround.


Best,

Jianwen

> On Sep 16, 2020, at 9:31 PM, Paul Edmon <ped...@cfa.harvard.edu> wrote:
> 
> This is a feature of suspend.  When Slurm suspends a job it actually does not 
> leave the cpus used by that job reserved but instead pauses the job and keeps 
> memory reserved but not the cpus.
> 
> If you want to pause jobs and not have contention you need to use scancel and 
> use the:
> 
> -s, --signal=signal_name
> The name or number of the signal to send. If this option is not used the 
> specified job or step will be terminated. Note. If this option is used the 
> signal is sent directly to the slurmd where the job is running bypassing the 
> slurmctld thus the job state will not change even if the signal is delivered 
> to it. Use the scontrol command if you want the job state change be known to 
> slurmctld.
> And issue the SIGSTOP or SIGCONT.
> 
> Frankly I wish suspend didn't work like this.  It should work where it 
> suspends the job and does not release the cpus but keeps them reserved.  
> That's the natural understanding of suspend, but that's not the way suspend 
> actually work in Slurm.
> 
> -Paul Edmon-
> 
> On 9/16/2020 6:08 AM, SJTU wrote:
>> Hi,
>> 
>> I am using SLURM 19.05 and found that SLURM may launch jobs onto nodes with 
>> suspended jobs, which leads to resource contention after the suspended jobs' 
>> restoration. Steps to reproduce this issue are:
>> 
>> 1. Launch 40 one-core jobs on a 40-core compute node. 
>> 2. Suspend all 40 jobs on that compute node with `scontrol suspend JOBID`.
>> 
>> Expected results: No more jobs should be launched on to the compute node 
>> since there are 40 suspended jobs on it already.
>> 
>> Actual results: SLURM launches new jobs on that compute node, which may lead 
>> to resource contention if the previously suspended jobs are restored via 
>> `scontrol resume` at the moment.
>>  
>> Any suggestion is appreciated. Part of slurm.conf is attached.
>> 
>> Thank you!
>> 
>> 
>> Jianwen
>> 
>> 
>> 
>> 
>> AccountingStorageEnforce = associations,limits,qos,safe
>> AccountingStorageType = accounting_storage/slurmdbd
>> AuthType = auth/munge
>> BackupController = slurm2
>> CacheGroups = 0
>> ClusterName = mycluster
>> ControlMachine = slurm1
>> EnforcePartLimits = true
>> Epilog = /etc/slurm/slurm.epilog
>> FastSchedule = 1
>> GresTypes = gpu
>> HealthCheckInterval = 300
>> HealthCheckProgram = /usr/sbin/nhc
>> InactiveLimit = 0
>> JobAcctGatherFrequency = 30
>> JobAcctGatherType = jobacct_gather/cgroup
>> JobCompType = jobcomp/none
>> JobRequeue = 0
>> JobSubmitPlugins = lua
>> KillOnBadExit = 1
>> KillWait = 30
>> MailProg = /opt/slurm-mail/bin/slurm-spool-mail.py
>> MaxArraySize = 8196
>> MaxJobCount = 100000
>> MessageTimeout = 30
>> MinJobAge = 300
>> MpiDefault = none
>> PriorityDecayHalfLife = 31-0
>> PriorityFavorSmall = false
>> PriorityFlags = ACCRUE_ALWAYS,FAIR_TREE
>> PriorityMaxAge = 7-0
>> PriorityType = priority/multifactor
>> PriorityWeightAge = 10000
>> PriorityWeightFairshare = 10000
>> PriorityWeightJobSize = 40000
>> PriorityWeightPartition = 10000
>> PriorityWeightQOS = 0
>> PrivateData = accounts,jobs,usage,users,reservations
>> ProctrackType = proctrack/cgroup
>> Prolog = /etc/slurm/slurm.prolog
>> PrologFlags = contain
>> PropagateResourceLimitsExcept = MEMLOCK
>> RebootProgram = /usr/sbin/reboot
>> ResumeTimeout = 600
>> ResvOverRun = UNLIMITED
>> ReturnToService = 1
>> SchedulerType = sched/backfill
>> SelectType = select/cons_res
>> SelectTypeParameters = CR_CPU
>> SlurmUser = root
>> SlurmctldDebug = info
>> SlurmctldLogFile = /var/log/slurmctld.log
>> SlurmctldPidFile = /var/run/slurmctld.pid
>> SlurmctldPort = 6817
>> SlurmctldTimeout = 120
>> SlurmdDebug = info
>> SlurmdLogFile = /var/log/slurmd.log
>> SlurmdPidFile = /var/run/slurmd.pid
>> SlurmdPort = 6818
>> SlurmdSpoolDir = /tmp/slurmd
>> SlurmdTimeout = 300
>> SrunPortRange = 60001-63000
>> StateSaveLocation = /etc/slurm/state
>> SwitchType = switch/none
>> TaskPlugin = task/cgroup
>> Waittime = 0
>> 
>> 
>> # Nodes
>> NodeName=cas[001-100] CPUs=40 SocketsPerBoard=2 CoresPerSocket=20 
>> ThreadsPerCore=1 RealMemory=190000 Weight=60
>> 
>> 
>> # Partitions
>> PartitionName=small Nodes=cas[001-100] MaxCPUsPerNode=39 MaxNodes=1 
>> MaxTime=7-00:00:00 DefMemPerCPU=4700 MaxMemPerCPU=4700 State=UP AllowQos=ALL
>> 
>> 
>> 
>> 
> _______________________________________________
> Support mailing list
> supp...@lists.hpc.sjtu.edu.cn
> http://lists.hpc.sjtu.edu.cn/mailman/listinfo/support

Re: [slurm-users] [Support] SLURM launching jobs onto nodes with suspended jobs may lead to resource contention

Reply via email to