Config details:

- Slurm v17.11.8
- QOS-based preemption
- Backfill scheduler (default parameters)
- QOS:
- "normal" = PreemptMode=CANCEL, GraceTime=5 minutes
- Per-stakeholder = Preempt=normal GrpTRES=<limits>
- Partitions:
- "standard" (default) = QOS=normal
- Per-stakeholder = QOS=<stakeholder-qos>

When users need priority access to purchased hardware, they submit to a 
stakeholder partition; jobs in stakeholder partitions can preempt jobs in the 
opportunistic "standard" partition.


Problem 1:  Preemption not logged
=================================

We've had a number of users reporting jobs that fail before they reach their 
TimeLimit.  No mention of preemption in the slurmd/slurmctld logs, but the 
EndTime has been altered from its original value and job step(s) end up with 
FAILED status:

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
237755       ewb0.55_b+   standard XXXXXXXXXX         72     FAILED     77:0 
237755.batch      batch            XXXXXXXXXX         36     FAILED     77:0 
237755.exte+     extern            XXXXXXXXXX         72  COMPLETED      0:0 
237755.0          orted            XXXXXXXXXX          1     FAILED      1:0 

The slurm-237755.out shows the MPI runtime lost contact with the remote orted 
daemon (237755.0) and failed.  I think I've traced this through the Slurm 
source far enough to determine that with a non-zero GraceTime, the EndTime is 
altered and the SIGCONT-SIGTERM pair are sent directly by slurmctld (no 
SIG_PREEMPTED, see slurm_job_check_grace() and _preempt_signal() in 
slurmctld/preempt.c); slurmctld logs nothing about the preemption at this point 
outside debug levels.  The signal is not caught by orted, so the 237755.0 step 
dies and logs in slurmd:

[237755.0] error: *** STEP 237755.0 ON r00n23 CANCELLED AT 2019-01-15T11:20:42 
***

because the signal in question was SIGTERM (see slurmd/slurmstepd/req.c).  This 
causes the MPI runtime to exit and the batch script returns in error.  Since 
GraceTime hasn't expired, slurmctld logs failure instead of preemption.  The 
only indicator of preemption is the altered EndTime (relative to its starting 
value) and a job from a per-stakeholder partition (which can preempt 
"standard") starting on the nodes _immediately_ following the death of 237755:

[2019-01-15T05:16:12.894] _slurm_rpc_submit_batch_job: JobId=237755 
InitPrio=2275 usec=5960
[2019-01-15T09:11:19.367] sched: Allocate JobID=237755 NodeList=r00n[21,23] 
#CPUs=72 Partition=standard
[2019-01-15T11:20:41.430] _slurm_rpc_submit_batch_job: JobId=238396 
InitPrio=22402 usec=6505
[2019-01-15T11:20:43.594] _job_complete: JobID=237755 State=0x1 NodeCnt=2 
WEXITSTATUS 205
[2019-01-15T11:20:43.594] email msg to XXXXXXXXXXXXXX: SLURM Job_id=237755 
Name=ewb0.55_b96sc00_a Failed, Run time 02:09:24, FAILED, ExitCode 205
[2019-01-15T11:20:43.595] _job_complete: JobID=237755 State=0x8005 NodeCnt=2 
done
[2019-01-15T11:20:44.606] email msg to XXXXXXXXXXXXXX: SLURM Job_id=238396 
Name=Pt111_sol Began, Queued time 00:00:03
[2019-01-15T11:20:44.606] sched: Allocate JobID=238396 NodeList=r00n[21,23] 
#CPUs=72 Partition=ccei_biomass
[2019-01-15T14:13:55.232] _job_complete: JobID=238396 State=0x1 NodeCnt=2 
WEXITSTATUS 0
[2019-01-15T14:13:55.237] _job_complete: JobID=238396 State=0x8003 NodeCnt=2 
done

In contrast, jobs that ignore/catch SIGCONT-SIGTERM and keep running through 
the 5 minute GraceTime are then killed and their overall state is logged as 
being preempted:

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
412845        TTC100um1   standard XXXXXXXXXX         36  PREEMPTED      0:0 
412845.batch      batch            XXXXXXXXXX         36  CANCELLED     0:15 
412845.exte+     extern            XXXXXXXXXX         36  COMPLETED      0:0 

So any job that catches SIGCONT-SIGTERM, gracefully ends its work, and exits 
before GraceTime expires would be logged as COMPLETED?  Is this behavior inline 
with anyone else's experience?



Problem 2:  Backfill of preempted node(s)
=========================================

This one was especially confounding:

[2019-03-07T23:08:43.769] _slurm_rpc_submit_batch_job: JobId=410289 
InitPrio=2997 usec=5694
[2019-03-07T23:08:46.215] udhpc: setting job account to safrono (1040)
[2019-03-07T23:08:46.215] udhpc: time_min is empty, setting to time_limit
   :
[2019-03-08T00:28:04.364] backfill: Started JobID=410289 in standard on r00n29
[2019-03-08T00:32:43.490] job_time_limit: Preemption GraceTime reached 
JobId=409950
[2019-03-08T00:32:43.501] job_time_limit: Preemption GraceTime reached 
JobId=409951
[2019-03-08T00:32:43.503] job_time_limit: Preemption GraceTime reached 
JobId=409952
[2019-03-08T00:32:43.505] job_time_limit: Preemption GraceTime reached 
JobId=409953
[2019-03-08T00:32:43.507] job_time_limit: Preemption GraceTime reached 
JobId=409954
[2019-03-08T00:32:43.509] job_time_limit: Preemption GraceTime reached 
JobId=409955
[2019-03-08T00:32:43.511] job_time_limit: Preemption GraceTime reached 
JobId=409956
[2019-03-08T00:32:43.513] job_time_limit: Preemption GraceTime reached 
JobId=409957
[2019-03-08T00:32:43.515] job_time_limit: Preemption GraceTime reached 
JobId=409958
[2019-03-08T00:32:43.517] job_time_limit: Preemption GraceTime reached 
JobId=409959
[2019-03-08T00:32:43.519] job_time_limit: Preemption GraceTime reached 
JobId=409960
[2019-03-08T00:32:43.520] job_time_limit: Preemption GraceTime reached 
JobId=409961
[2019-03-08T00:32:43.522] job_time_limit: Preemption GraceTime reached 
JobId=409962
[2019-03-08T00:32:43.524] job_time_limit: Preemption GraceTime reached 
JobId=409963
[2019-03-08T00:32:43.526] job_time_limit: Preemption GraceTime reached 
JobId=409964
[2019-03-08T00:32:43.527] job_time_limit: Preemption GraceTime reached 
JobId=409965
[2019-03-08T00:32:43.529] job_time_limit: Preemption GraceTime reached 
JobId=409966
[2019-03-08T00:32:43.531] job_time_limit: Preemption GraceTime reached 
JobId=409967
[2019-03-08T00:32:43.532] job_time_limit: Preemption GraceTime reached 
JobId=409968
[2019-03-08T00:32:43.534] job_time_limit: Preemption GraceTime reached 
JobId=409969
[2019-03-08T00:32:43.535] job_time_limit: Preemption GraceTime reached 
JobId=409970
[2019-03-08T00:32:43.537] job_time_limit: Preemption GraceTime reached 
JobId=409971
[2019-03-08T00:32:43.538] job_time_limit: Preemption GraceTime reached 
JobId=410003
[2019-03-08T00:32:46.226] _job_complete: JobID=410289 State=0x1 NodeCnt=1 
WEXITSTATUS 1
[2019-03-08T00:32:46.230] _job_complete: JobID=410289 State=0x8005 NodeCnt=1 
done
[2019-03-08T00:32:46.310] email msg to lina...@udel.edu: SLURM Job_id=410300 
Name=mesh_12x12x4_T3 Began, Queued time 00:05:17
[2019-03-08T00:32:46.350] sched: Allocate JobID=410300 NodeList=r00n29 #CPUs=36 
Partition=ccei_biomass

Job 410300 was in a per-stakeholder partition, thus preempting jobs in the 
"standard" partition.  For GraceTime of 5 minutes, the preemption occurred ca. 
00:27:43.  So how did the backfill scheduler manage to start a new job (410289) 
on that node at 00:28:04, _after_ the preemption state should have been noted 
on r00n29?  The submitted TimeLimit on 410289 was 7 days, yet again its EndTime 
was altered to the exact time it died, and it died the same way as in Problem 
1:  initial SIGCONT-SIGTERM delivery killed step 0, etc.  With EndTime reset to 
00:32:46, the preemption would have had to happen ca. 00:27:46.  Seems like a 
race condition.


Release notes don't seem to indicate that these issues were known and have been 
addressed, and I didn't find anything on bugs.schedmd.com 
<http://bugs.schedmd.com/> that seemed to pertain.  To anyone who has 
encountered like/similar behavior, how did you mitigate it?




::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE  19716
Office: (302) 831-6034  Mobile: (302) 419-4976
::::::::::::::::::::::::::::::::::::::::::::::::::::::




Reply via email to