Hi all,

according to the SLURM documentation,  SIGCONT and SIGTERM signals are sent 
twice to a job that is selected for preemption:

“Once a job has been selected for preemption, its end time is set to the 
current time plus GraceTime. The job is immediately sent SIGCONT and SIGTERM 
signals in order to provide notification of its imminent termination. This is 
followed by the SIGCONT, SIGTERM and SIGKILL signal sequence upon reaching its 
new end time.”

While I can trap the first SIGTERM in a job submitted with srun or in a job 
step launched with srun (from inside a batch script submitted with sbatch), I 
cannot trap the first SIGTERM in a batch script submitted with sbatch, i.e. the 
batch script only receives a SIGTERM after GraceTime has expired. Why is the 
first SIGTERM not sent to the batch shell? I use the following test job:

#!/bin/bash

housekeeping() {
        echo "$(date): Cleaning up..." >> job.log
        sleep 10
        echo "$(date): Done." >> job.log
        exit 1
}

trap 'housekeeping' TERM

echo "$(date): Starting batch job." >> job.log

while true; do
        sleep 2 &
        wait $!
done

exit 0

Example: Submitting the test job with sbatch:

SubmitTime=2018-10-18T15:01:52 EligibleTime=2018-10-18T15:01:52
StartTime=2018-10-18T15:01:54 EndTime=2018-10-18T15:03:13 Deadline=N/A
PreemptTime=2018-10-18T15:02:13 SuspendTime=None SecsPreSuspend=0

job.log:
Thu Oct 18 15:01:54 CEST 2018: Starting batch job.
Thu Oct 18 15:03:24 CEST 2018: Cleaning up...
Thu Oct 18 15:03:34 CEST 2018: Done.

Example: Submitting the test job with srun:

SubmitTime=2018-10-18T15:08:52 EligibleTime=2018-10-18T15:08:52
StartTime=2018-10-18T15:08:52 EndTime=2018-10-18T15:09:50 Deadline=N/A
PreemptTime=2018-10-18T15:09:40 SuspendTime=None SecsPreSuspend=0

job.log:
Thu Oct 18 15:08:52 CEST 2018: Starting batch job.
Thu Oct 18 15:09:40 CEST 2018: Cleaning up...
Thu Oct 18 15:09:50 CEST 2018: Done.

Slurm version 17.02.10

slurm.conf:
(…)
PreemptType=preempt/qos
PreemptMode=CANCEL
PartitionName=low-prio Nodes=node[01-09] DefaultTime=01:00:00 MaxTime=24:00:00 
DefMemPerCPU=2020 GraceTime=60 State=UP QOS=part_gpu
(…)

Is this the intended behavior or am I missing something? It seems that the only 
way to perform housekeeping from inside a batch script is to use the --signal 
option, e.g. --signal=B:TERM@60 or the extra time provided by KillWait. Can 
anybody confirm?

Thank you!

---
Universität Bern
Informatikdienste
Gruppe Systemdienste

Nico Färber
Systemadministrator HPC

Hochschulstrasse 6
CH-3012 Bern
Tel. +41 (0)31 631 51 89

mailto: grid-supp...@id.unibe.ch<mailto:grid-supp...@id.unibe.ch>
http://www.id.unibe.ch/


Reply via email to