Hi All, I want to preempt a job from a low-priority partition and restart it when resources are available again, but a restarted job fails immediately. Are there any manners or configurations for job preemption?
I used slurm-docker-cluster to build the slurm cluster for testing. The same problem occurred in other clusters deployed with NVIDA/DeepOps. The relevant configurations are as follows: SLURM_VERSION = 19.05.1-2 SchedulerType = sched/backfill SelectType = select/cons_res SelectTypeParameters = CR_CPU_MEMORY PreemptMode = REQUEUE PreemptType = preempt/partition_prio Low-priority partition info: PartitionName=common.q AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=c[1-2] PriorityJobFactor=1 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=REQUEUE State=UP TotalCPUs=2 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=UNLIMITED MaxMemPerNode=UNLIMITED High-priority partition info: PartitionName=sepang.q AllowGroups=ALL AllowAccounts=sepang AllowQos=ALL AllocNodes=ALL Default=NO QoS=sepang DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=c[1-2] PriorityJobFactor=1 PriorityTier=20 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=2 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=UNLIMITED MaxMemPerNode=UNLIMITED I executed following script. The first job was preempted by the third job and requeued. The requeued job restarted but sometimes failed immediately. ``` #!/usr/bin/env bash set -ex time=130 sbatch -p common.q -n 1 --wrap="srun -n 1 sleep $time" sleep 5 sbatch -p common.q -n 1 --wrap="srun -n 1 sleep $time" sleep 5 sbatch -p sepang.q -n 1 --wrap="srun -n 1 sleep $time" sleep 1 ``` It seems that jobs that restarted by the main scheduling loop or events fail immediately, and jobs that restarted by the backfill scheduler complete successfully. What is the difference between "sched" and "backfill". The following JobId 117 and 120 jobs were both preempted and requeued. JobId 117 was restarted by sched, but jobid 120 by backfill. If the script sleep for 120 seconds, the backfill will tend to restart the job, and if 130 seconds, sched will restart. JobId 117-119 are the former and jobid 120-122 are the latter. ``` # sacct -X -o jobid,jobname,partition,alloccpus,state,exitcode JobID JobName Partition AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- -------- 118 wrap common.q 1 COMPLETED 0:0 117 wrap common.q 1 COMPLETED 0:0 119 wrap sepang.q 1 COMPLETED 0:0 121 wrap common.q 1 COMPLETED 0:0 122 wrap sepang.q 1 COMPLETED 0:0 120 wrap common.q 1 FAILED 1:0 ``` slurmctld.log: ``` # FAILED case [2019-11-26T10:07:35.322] sched: Allocate JobId=120 NodeList=c2 #CPUs=1 Partition=common.q [2019-11-26T10:07:35.322] debug2: _group_cache_lookup_internal: found valid entry for tanaka [2019-11-26T10:07:35.323] debug2: Spawning RPC agent for msg_type REQUEST_BATCH_JOB_LAUNCH [2019-11-26T10:07:35.323] debug2: Tree head got back 0 looking for 1 [2019-11-26T10:07:35.324] debug2: Tree head got back 1 [2019-11-26T10:07:35.328] debug2: node_did_resp c2 [2019-11-26T10:07:35.374] debug2: Processing RPC: REQUEST_JOB_PACK_ALLOC_INFO from uid=1016 [2019-11-26T10:07:35.374] debug: _slurm_rpc_job_pack_alloc_info: JobId=120 NodeList=c2 usec=81 [2019-11-26T10:07:35.385] debug2: Processing RPC: REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 JobId=120 [2019-11-26T10:07:35.386] _job_complete: JobId=120 WEXITSTATUS 1 # COMPLETED case [2019-11-26T09:57:39.243] backfill: Started JobId=117 in common.q on c1 [2019-11-26T09:57:39.243] debug2: _group_cache_lookup_internal: found valid entry for tanaka [2019-11-26T09:57:39.243] debug2: altering JobId=117 QOS normal got 31536000 just removed 257698037700 and added 31536000 [2019-11-26T09:57:39.243] debug2: altering JobId=117 QOS normal got 31536000 just removed 257698037700 and added 31536000 [2019-11-26T09:57:39.244] debug2: altering JobId=117 QOS normal got 31536000 just removed 257698037700 and added 31536000 [2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc 12(sepang/tanaka/(null)) got 31536000 just removed 257698037700 and added 31536000 [2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc 12(sepang/tanaka/(null)) got 31536000 just removed 257698037700 and added 31536000 [2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc 12(sepang/tanaka/(null)) got 31536000 just removed 257698037700 and added 31536000 [2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc 5(sepang/(null)/(null)) got 31536000 just removed 257698037700 and added 31536000 [2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc 5(sepang/(null)/(null)) got 31536000 just removed 257698037700 and added 31536000 [2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc 5(sepang/(null)/(null)) got 31536000 just removed 257698037700 and added 31536000 [2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc 1(root/(null)/(null)) got 31536000 just removed 257698037700 and added 31536000 [2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc 1(root/(null)/(null)) got 31536000 just removed 257698037700 and added 31536000 [2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc 1(root/(null)/(null)) got 31536000 just removed 257698037700 and added 31536000 [2019-11-26T09:57:39.244] debug2: Spawning RPC agent for msg_type REQUEST_BATCH_JOB_LAUNCH [2019-11-26T09:57:39.245] debug2: Tree head got back 0 looking for 1 [2019-11-26T09:57:39.246] debug2: Tree head got back 1 [2019-11-26T09:57:39.250] debug2: node_did_resp c1 [2019-11-26T09:57:39.277] debug2: Processing RPC: REQUEST_JOB_PACK_ALLOC_INFO from uid=1016 [2019-11-26T09:57:39.277] debug: _slurm_rpc_job_pack_alloc_info: JobId=117 NodeList=c1 usec=78 [2019-11-26T09:57:39.278] debug: laying out the 1 tasks on 1 hosts c1 dist 2 [2019-11-26T09:57:39.279] debug2: _group_cache_lookup_internal: found valid entry for tanaka [2019-11-26T09:58:02.533] debug2: Testing job time limits and checkpoints [2019-11-26T09:58:09.244] debug: backfill: beginning [2019-11-26T09:58:09.244] debug: backfill: no jobs to backfill [2019-11-26T09:58:25.559] debug2: Performing purge of old job records [2019-11-26T09:58:25.559] debug2: purge_old_job: purged 1 old job records [2019-11-26T09:58:25.559] debug2: _purge_files_thread: starting, 1 jobs to purge [2019-11-26T09:58:25.559] debug2: _purge_files_thread: purging files from JobId=115 [2019-11-26T09:58:25.559] debug: sched: Running job scheduler [2019-11-26T09:58:32.567] debug2: Testing job time limits and checkpoints [2019-11-26T09:58:39.245] debug: backfill: beginning [2019-11-26T09:58:39.245] debug: backfill: no jobs to backfill [2019-11-26T09:59:02.601] debug2: Testing job time limits and checkpoints [2019-11-26T09:59:25.627] debug2: Performing purge of old job records [2019-11-26T09:59:25.627] debug: sched: Running job scheduler [2019-11-26T09:59:32.635] debug2: Testing job time limits and checkpoints [2019-11-26T09:59:39.324] debug2: full switch release for JobId=117 StepId=1, nodes c1 [2019-11-26T09:59:39.342] debug2: Processing RPC: REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 JobId=117 [2019-11-26T09:59:39.342] _job_complete: JobId=117 WEXITSTATUS 0 [2019-11-26T09:59:39.342] _job_complete: JobId=117 done ``` slurmd.log: ``` # FAILED case [2019-11-26T10:07:35.341] [120.batch] task 0 (1108) started 2019-11-26T10:07:35 [2019-11-26T10:07:35.342] [120.batch] debug: task_p_pre_launch_priv: 120.4294967294 [2019-11-26T10:07:35.342] [120.batch] debug2: adding task 0 pid 1108 on node 0 to jobacct [2019-11-26T10:07:35.344] [120.batch] debug2: _get_precs: energy = 0 watts = 0 [2019-11-26T10:07:35.356] [120.batch] debug2: xcgroup_load: unable to get cgroup '(null)/cpuset' entry '(null)/cpuset/system' properties: No such file or directory [2019-11-26T10:07:35.356] [120.batch] debug2: xcgroup_load: unable to get cgroup '(null)/memory' entry '(null)/memory/system' properties: No such file or directory [2019-11-26T10:07:35.358] [120.batch] debug: task_p_pre_launch: 120.4294967294, task 0 [2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit RLIMIT_CPU no change in value: 18446744073709551615 [2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit RLIMIT_FSIZE no change in value: 18446744073709551615 [2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit RLIMIT_DATA no change in value: 18446744073709551615 [2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608 [2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 18446744073709551615 [2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit RLIMIT_RSS no change in value: 18446744073709551615 [2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: RLIMIT_NPROC : max:inf cur:inf req:4096 [2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit RLIMIT_NPROC succeeded [2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit RLIMIT_NOFILE no change in value: 1048576 [2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in value: 65536 [2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615 [2019-11-26T10:07:35.379] [120.batch] debug2: _get_precs: energy = 0 watts = 0 [2019-11-26T10:07:35.379] [120.batch] debug2: removing task 0 pid 1108 from jobacct [2019-11-26T10:07:35.380] [120.batch] task 0 (1108) exited with exit code 1. ``` ``` # sacct -o jobid,jobname,partition,alloccpus,state,exitcode,submit,start,end,elapsed -j 117,118,119,120,121,122 JobID JobName Partition AllocCPUS State ExitCode Submit Start End Elapsed ------------ ---------- ---------- ---------- ---------- -------- ------------------- ------------------- ------------------- ---------- 118 wrap common.q 1 COMPLETED 0:0 2019-11-26T09:55:29 2019-11-26T09:55:30 2019-11-26T09:57:30 00:02:00 118.batch batch 1 COMPLETED 0:0 2019-11-26T09:55:30 2019-11-26T09:55:30 2019-11-26T09:57:30 00:02:00 118.0 sleep 1 COMPLETED 0:0 2019-11-26T09:55:30 2019-11-26T09:55:30 2019-11-26T09:57:30 00:02:00 117 wrap common.q 1 COMPLETED 0:0 2019-11-26T09:55:34 2019-11-26T09:57:39 2019-11-26T09:59:39 00:02:00 117.batch batch 1 COMPLETED 0:0 2019-11-26T09:57:39 2019-11-26T09:57:39 2019-11-26T09:59:39 00:02:00 117.1 sleep 1 COMPLETED 0:0 2019-11-26T09:57:39 2019-11-26T09:57:39 2019-11-26T09:59:39 00:02:00 119 wrap sepang.q 1 COMPLETED 0:0 2019-11-26T09:55:34 2019-11-26T09:55:34 2019-11-26T09:57:34 00:02:00 119.batch batch 1 COMPLETED 0:0 2019-11-26T09:55:34 2019-11-26T09:55:34 2019-11-26T09:57:34 00:02:00 119.0 sleep 1 COMPLETED 0:0 2019-11-26T09:55:34 2019-11-26T09:55:34 2019-11-26T09:57:34 00:02:00 121 wrap common.q 1 COMPLETED 0:0 2019-11-26T10:05:24 2019-11-26T10:05:25 2019-11-26T10:07:35 00:02:10 121.batch batch 1 COMPLETED 0:0 2019-11-26T10:05:25 2019-11-26T10:05:25 2019-11-26T10:07:35 00:02:10 121.0 sleep 1 COMPLETED 0:0 2019-11-26T10:05:25 2019-11-26T10:05:25 2019-11-26T10:07:35 00:02:10 122 wrap sepang.q 1 COMPLETED 0:0 2019-11-26T10:05:29 2019-11-26T10:05:30 2019-11-26T10:07:40 00:02:10 122.batch batch 1 COMPLETED 0:0 2019-11-26T10:05:30 2019-11-26T10:05:30 2019-11-26T10:07:40 00:02:10 122.0 sleep 1 COMPLETED 0:0 2019-11-26T10:05:30 2019-11-26T10:05:30 2019-11-26T10:07:40 00:02:10 120 wrap common.q 1 FAILED 1:0 2019-11-26T10:05:30 2019-11-26T10:07:35 2019-11-26T10:07:35 00:00:00 120.batch batch 1 FAILED 1:0 2019-11-26T10:07:35 2019-11-26T10:07:35 2019-11-26T10:07:35 00:00:00 ``` Thanks, Koso Kashima