[slurm-users] Restarted job fails immediately

Koso Kashima Wed, 27 Nov 2019 16:50:44 -0800

Hi All,

I want to preempt a job from a low-priority partition and restart it when
resources are available again, but a restarted job fails immediately.
Are there any manners or configurations for job preemption?


I used slurm-docker-cluster to build the slurm cluster for testing. The
same problem occurred in other clusters deployed with NVIDA/DeepOps.

The relevant configurations are as follows:

SLURM_VERSION           = 19.05.1-2
SchedulerType           = sched/backfill
SelectType              = select/cons_res
SelectTypeParameters    = CR_CPU_MEMORY
PreemptMode             = REQUEUE
PreemptType             = preempt/partition_prio

Low-priority partition info:

PartitionName=common.q
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED
   Nodes=c[1-2]
   PriorityJobFactor=1 PriorityTier=10 RootOnly=NO ReqResv=NO
OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=REQUEUE
   State=UP TotalCPUs=2 TotalNodes=2 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=UNLIMITED MaxMemPerNode=UNLIMITED

High-priority partition info:

PartitionName=sepang.q
   AllowGroups=ALL AllowAccounts=sepang AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=sepang
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED
   Nodes=c[1-2]
   PriorityJobFactor=1 PriorityTier=20 RootOnly=NO ReqResv=NO
OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2 TotalNodes=2 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=UNLIMITED MaxMemPerNode=UNLIMITED

I executed following script.
The first job was preempted by the third job and requeued.
The requeued job restarted but sometimes failed immediately.


```
#!/usr/bin/env bash

set -ex

time=130

sbatch -p common.q -n 1 --wrap="srun -n 1 sleep $time"
sleep 5

sbatch -p common.q -n 1 --wrap="srun -n 1 sleep $time"
sleep 5

sbatch -p sepang.q -n 1 --wrap="srun -n 1 sleep $time"
sleep 1
```

It seems that jobs that restarted by the main scheduling loop or events
fail immediately,
and jobs that restarted by the backfill scheduler complete successfully.

What is the difference between "sched" and "backfill".

The following JobId 117 and 120 jobs were both preempted and requeued.
JobId 117 was restarted by sched, but jobid 120 by backfill.
If the script sleep for 120 seconds, the backfill will tend to restart the
job, and if 130 seconds, sched will restart.
JobId 117-119 are the former and jobid 120-122 are the latter.

```
# sacct -X -o jobid,jobname,partition,alloccpus,state,exitcode
       JobID    JobName  Partition  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- --------
118                wrap   common.q          1  COMPLETED      0:0
117                wrap   common.q          1  COMPLETED      0:0
119                wrap   sepang.q          1  COMPLETED      0:0

121                wrap   common.q          1  COMPLETED      0:0
122                wrap   sepang.q          1  COMPLETED      0:0
120                wrap   common.q          1     FAILED      1:0
```

slurmctld.log:

```
# FAILED case

[2019-11-26T10:07:35.322] sched: Allocate JobId=120 NodeList=c2 #CPUs=1
Partition=common.q
[2019-11-26T10:07:35.322] debug2: _group_cache_lookup_internal: found valid
entry for tanaka
[2019-11-26T10:07:35.323] debug2: Spawning RPC agent for msg_type
REQUEST_BATCH_JOB_LAUNCH
[2019-11-26T10:07:35.323] debug2: Tree head got back 0 looking for 1
[2019-11-26T10:07:35.324] debug2: Tree head got back 1
[2019-11-26T10:07:35.328] debug2: node_did_resp c2
[2019-11-26T10:07:35.374] debug2: Processing RPC:
REQUEST_JOB_PACK_ALLOC_INFO from uid=1016
[2019-11-26T10:07:35.374] debug:  _slurm_rpc_job_pack_alloc_info: JobId=120
NodeList=c2 usec=81
[2019-11-26T10:07:35.385] debug2: Processing RPC:
REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 JobId=120
[2019-11-26T10:07:35.386] _job_complete: JobId=120 WEXITSTATUS 1

# COMPLETED case

[2019-11-26T09:57:39.243] backfill: Started JobId=117 in common.q on c1
[2019-11-26T09:57:39.243] debug2: _group_cache_lookup_internal: found valid
entry for tanaka
[2019-11-26T09:57:39.243] debug2: altering JobId=117 QOS normal got
31536000 just removed 257698037700 and added 31536000
[2019-11-26T09:57:39.243] debug2: altering JobId=117 QOS normal got
31536000 just removed 257698037700 and added 31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 QOS normal got
31536000 just removed 257698037700 and added 31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc
12(sepang/tanaka/(null)) got 31536000 just removed 257698037700 and added
31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc
12(sepang/tanaka/(null)) got 31536000 just removed 257698037700 and added
31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc
12(sepang/tanaka/(null)) got 31536000 just removed 257698037700 and added
31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc
5(sepang/(null)/(null)) got 31536000 just removed 257698037700 and added
31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc
5(sepang/(null)/(null)) got 31536000 just removed 257698037700 and added
31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc
5(sepang/(null)/(null)) got 31536000 just removed 257698037700 and added
31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc
1(root/(null)/(null)) got 31536000 just removed 257698037700 and added
31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc
1(root/(null)/(null)) got 31536000 just removed 257698037700 and added
31536000
[2019-11-26T09:57:39.244] debug2: altering JobId=117 assoc
1(root/(null)/(null)) got 31536000 just removed 257698037700 and added
31536000
[2019-11-26T09:57:39.244] debug2: Spawning RPC agent for msg_type
REQUEST_BATCH_JOB_LAUNCH
[2019-11-26T09:57:39.245] debug2: Tree head got back 0 looking for 1
[2019-11-26T09:57:39.246] debug2: Tree head got back 1
[2019-11-26T09:57:39.250] debug2: node_did_resp c1
[2019-11-26T09:57:39.277] debug2: Processing RPC:
REQUEST_JOB_PACK_ALLOC_INFO from uid=1016
[2019-11-26T09:57:39.277] debug:  _slurm_rpc_job_pack_alloc_info: JobId=117
NodeList=c1 usec=78
[2019-11-26T09:57:39.278] debug:  laying out the 1 tasks on 1 hosts c1 dist
2
[2019-11-26T09:57:39.279] debug2: _group_cache_lookup_internal: found valid
entry for tanaka
[2019-11-26T09:58:02.533] debug2: Testing job time limits and checkpoints
[2019-11-26T09:58:09.244] debug:  backfill: beginning
[2019-11-26T09:58:09.244] debug:  backfill: no jobs to backfill
[2019-11-26T09:58:25.559] debug2: Performing purge of old job records
[2019-11-26T09:58:25.559] debug2: purge_old_job: purged 1 old job records
[2019-11-26T09:58:25.559] debug2: _purge_files_thread: starting, 1 jobs to
purge
[2019-11-26T09:58:25.559] debug2: _purge_files_thread: purging files from
JobId=115
[2019-11-26T09:58:25.559] debug:  sched: Running job scheduler
[2019-11-26T09:58:32.567] debug2: Testing job time limits and checkpoints
[2019-11-26T09:58:39.245] debug:  backfill: beginning
[2019-11-26T09:58:39.245] debug:  backfill: no jobs to backfill
[2019-11-26T09:59:02.601] debug2: Testing job time limits and checkpoints
[2019-11-26T09:59:25.627] debug2: Performing purge of old job records
[2019-11-26T09:59:25.627] debug:  sched: Running job scheduler
[2019-11-26T09:59:32.635] debug2: Testing job time limits and checkpoints
[2019-11-26T09:59:39.324] debug2: full switch release for JobId=117
StepId=1, nodes c1
[2019-11-26T09:59:39.342] debug2: Processing RPC:
REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 JobId=117
[2019-11-26T09:59:39.342] _job_complete: JobId=117 WEXITSTATUS 0
[2019-11-26T09:59:39.342] _job_complete: JobId=117 done

```
slurmd.log:

```
# FAILED case

[2019-11-26T10:07:35.341] [120.batch] task 0 (1108) started
2019-11-26T10:07:35
[2019-11-26T10:07:35.342] [120.batch] debug:  task_p_pre_launch_priv:
120.4294967294
[2019-11-26T10:07:35.342] [120.batch] debug2: adding task 0 pid 1108 on
node 0 to jobacct
[2019-11-26T10:07:35.344] [120.batch] debug2: _get_precs: energy = 0 watts
= 0
[2019-11-26T10:07:35.356] [120.batch] debug2: xcgroup_load: unable to get
cgroup '(null)/cpuset' entry '(null)/cpuset/system' properties: No such
file or directory
[2019-11-26T10:07:35.356] [120.batch] debug2: xcgroup_load: unable to get
cgroup '(null)/memory' entry '(null)/memory/system' properties: No such
file or directory
[2019-11-26T10:07:35.358] [120.batch] debug:  task_p_pre_launch:
120.4294967294, task 0
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_CPU no change in value: 18446744073709551615
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_FSIZE no change in value: 18446744073709551615
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_DATA no change in value: 18446744073709551615
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_STACK no change in value: 8388608
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_CORE no change in value: 18446744073709551615
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_RSS no change in value: 18446744073709551615
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: RLIMIT_NPROC  :
max:inf cur:inf req:4096
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_NPROC succeeded
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_NOFILE no change in value: 1048576
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_MEMLOCK no change in value: 65536
[2019-11-26T10:07:35.358] [120.batch] debug2: _set_limit: conf setrlimit
RLIMIT_AS no change in value: 18446744073709551615
[2019-11-26T10:07:35.379] [120.batch] debug2: _get_precs: energy = 0 watts
= 0

          [2019-11-26T10:07:35.379] [120.batch] debug2: removing task 0 pid
1108 from jobacct
[2019-11-26T10:07:35.380] [120.batch] task 0 (1108) exited with exit code 1.
```

```
# sacct -o 
jobid,jobname,partition,alloccpus,state,exitcode,submit,start,end,elapsed
-j 117,118,119,120,121,122
       JobID    JobName  Partition  AllocCPUS      State ExitCode
   Submit               Start                 End    Elapsed
------------ ---------- ---------- ---------- ---------- --------
------------------- ------------------- ------------------- ----------
118                wrap   common.q          1  COMPLETED      0:0
2019-11-26T09:55:29 2019-11-26T09:55:30 2019-11-26T09:57:30   00:02:00
118.batch         batch                     1  COMPLETED      0:0
2019-11-26T09:55:30 2019-11-26T09:55:30 2019-11-26T09:57:30   00:02:00
118.0             sleep                     1  COMPLETED      0:0
2019-11-26T09:55:30 2019-11-26T09:55:30 2019-11-26T09:57:30   00:02:00
117                wrap   common.q          1  COMPLETED      0:0
2019-11-26T09:55:34 2019-11-26T09:57:39 2019-11-26T09:59:39   00:02:00
117.batch         batch                     1  COMPLETED      0:0
2019-11-26T09:57:39 2019-11-26T09:57:39 2019-11-26T09:59:39   00:02:00
117.1             sleep                     1  COMPLETED      0:0
2019-11-26T09:57:39 2019-11-26T09:57:39 2019-11-26T09:59:39   00:02:00
119                wrap   sepang.q          1  COMPLETED      0:0
2019-11-26T09:55:34 2019-11-26T09:55:34 2019-11-26T09:57:34   00:02:00
119.batch         batch                     1  COMPLETED      0:0
2019-11-26T09:55:34 2019-11-26T09:55:34 2019-11-26T09:57:34   00:02:00
119.0             sleep                     1  COMPLETED      0:0
2019-11-26T09:55:34 2019-11-26T09:55:34 2019-11-26T09:57:34   00:02:00
121                wrap   common.q          1  COMPLETED      0:0
2019-11-26T10:05:24 2019-11-26T10:05:25 2019-11-26T10:07:35   00:02:10
121.batch         batch                     1  COMPLETED      0:0
2019-11-26T10:05:25 2019-11-26T10:05:25 2019-11-26T10:07:35   00:02:10
121.0             sleep                     1  COMPLETED      0:0
2019-11-26T10:05:25 2019-11-26T10:05:25 2019-11-26T10:07:35   00:02:10
122                wrap   sepang.q          1  COMPLETED      0:0
2019-11-26T10:05:29 2019-11-26T10:05:30 2019-11-26T10:07:40   00:02:10
122.batch         batch                     1  COMPLETED      0:0
2019-11-26T10:05:30 2019-11-26T10:05:30 2019-11-26T10:07:40   00:02:10
122.0             sleep                     1  COMPLETED      0:0
2019-11-26T10:05:30 2019-11-26T10:05:30 2019-11-26T10:07:40   00:02:10
120                wrap   common.q          1     FAILED      1:0
2019-11-26T10:05:30 2019-11-26T10:07:35 2019-11-26T10:07:35   00:00:00
120.batch         batch                     1     FAILED      1:0
2019-11-26T10:07:35 2019-11-26T10:07:35 2019-11-26T10:07:35   00:00:00
```

Thanks,
Koso Kashima

[slurm-users] Restarted job fails immediately

Reply via email to