[slurm-users] Runaway jobs issue: : Resource temporarily unavailable, slurm 17.11.3

Christopher Benjamin Coffey Tue, 24 Apr 2018 10:20:44 -0700

Hi, we have an issue currently where we have a bunch (56K) of runaway jobs, but 
we cannot clear them:


sacctmgr show runaway|wc -l
sacctmgr: error: slurmdbd: Sending message type 1488: 11: No error
sacctmgr: error: Failed to fix runaway job: Resource temporarily unavailable

58588

Has anyone run into this? We've tried, restarting slurmdbd, slurmctl, mysql, 
etc. but does not help.

Slurmdbd log shows the following when the "sacctmgr show runawayjobs" command 
is run.

[2018-04-24T07:56:03.869] error: Invalid msg_size (31302621) from connection 
12(172.16.2.1) uid(3510)
[2018-04-24T07:56:03.872] error: Invalid msg_size (31302621) from connection 
7(172.16.2.1) uid(3510)
[2018-04-24T07:56:03.874] error: Invalid msg_size (31302621) from connection 
12(172.16.2.1) uid(3510)
[2018-04-24T07:56:03.875] error: Invalid msg_size (31302621) from connection 
7(172.16.2.1) uid(3510)
[2018-04-24T07:56:03.877] error: Invalid msg_size (31302621) from connection 
12(172.16.2.1) uid(3510)

Seems to indicate that possibly there are too many runaway jobs needing to be 
cleared? I wonder if there is a way to select a fewer number for removal. Don't 
see that option however.

This all started last week when slurm crashed due to being seriously hammered 
by a user submitting 500K 2min jobs. Slurmdbd appeared to not be able to handle 
all the transactions that slurmctl was sending it:

...
[2018-04-15T20:16:35.021] slurmdbd: agent queue size 100
[2018-04-16T11:54:29.312] slurmdbd: agent queue size 200
[2018-04-18T17:53:22.339] slurmdbd: agent queue size 19100
[2018-04-18T17:59:58.413] slurmdbd: agent queue size 64100
[2018-04-18T18:06:10.143] slurmdbd: agent queue size 104300
...

...
[2018-04-18T18:20:37.597] error: slurmdbd: agent queue filling (200214), 
RESTART SLURMDBD NOW
...

...
error: slurmdbd: Sending fini msg: No error
...

So now at this point, lots and lots of our nodes are idle, but slurm is not 
starting jobs.

[cbc@siris ~ ]$ sreport cluster utilization
--------------------------------------------------------------------------------
Cluster Utilization 2018-04-23T00:00:00 - 2018-04-23T23:59:59
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster Allocated     Down PLND Dow     Idle Reserved  Reported 
--------- --------- -------- -------- -------- -------- --------- 
  monsoon   4216320        0        0        0        0   4216320


sreport shows the entire cluster fully utilized, yet this is not the case.

I see that there is a fix for runaway jobs in version 17.11.5:

-- sacctmgr - fix runaway jobs identification.

We upgraded to 17.11.5 this morning but still we cannot clear the runaway jobs. 
I wonder if we'll need to manually remove them with some mysql foo. We are 
investigating this now.

Hope maybe someone has run into this.

Thanks,
Chris
—

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167

[slurm-users] Runaway jobs issue: : Resource temporarily unavailable, slurm 17.11.3

Reply via email to