Hello,

I wondered if someone could please help us to understand why the 
PrologFlags=contain flag is causing jobs to fail and draining compute nodes. We 
are, by the way, using slurm 18.08.0. Has anyone else seem this behaviour?

I'm currently experimenting with PrologFlags=contain. I've found that the 
addition of this flag in the slurm.conf radically changes the behaviour of jobs 
on the compute nodes.

When PrologFlags=contain is commented out in the slurm.conf jobs are assigned 
to the compute node and start/execute as expected. Here is the relevant extract 
from the slurmd logs on that node..

[2018-12-12T09:51:40.748] _run_prolog: run job script took usec=4
[2018-12-12T09:51:40.748] _run_prolog: prolog with lock for job 243317 ran for 
0 seconds
[2018-12-12T09:51:40.748] Launching batch job 243317 for UID 57337
[2018-12-12T09:51:40.762] [243317.batch] task/cgroup: 
/slurm/uid_57337/job_243317: alloc=0MB mem.limit=193080MB memsw.limit=unlimited
[2018-12-12T09:51:40.763] [243317.batch] task/cgroup: 
/slurm/uid_57337/job_243317/step_batch: alloc=0MB mem.limit=193080MB 
memsw.limit=unlimited

When PrologFlags=contain is activated I find the following...

-- I don't see the "_run_prolog" and the "task/cgroup" messages in the slurmd 
logs
-- The job prolog fails, the job fails and the job output is owned by root
-- The compute node is drained.

sinfo -lN | grep red017 ....
red017         1     batch*     drained   40   2:20:1 190000        0      1   
(null) batch job complete f

Here is the extract from the slurmd logs

[2018-12-12T09:56:54.564] error: Waiting for JobId=243321 prolog has failed, 
giving up after 50 sec
[2018-12-12T09:56:54.565] Could not launch job 243321 and not able to requeue 
it, cancelling job

Best regards,

David

Reply via email to