Hello, I wondered if someone could please help us to understand why the PrologFlags=contain flag is causing jobs to fail and draining compute nodes. We are, by the way, using slurm 18.08.0. Has anyone else seem this behaviour?
I'm currently experimenting with PrologFlags=contain. I've found that the addition of this flag in the slurm.conf radically changes the behaviour of jobs on the compute nodes. When PrologFlags=contain is commented out in the slurm.conf jobs are assigned to the compute node and start/execute as expected. Here is the relevant extract from the slurmd logs on that node.. [2018-12-12T09:51:40.748] _run_prolog: run job script took usec=4 [2018-12-12T09:51:40.748] _run_prolog: prolog with lock for job 243317 ran for 0 seconds [2018-12-12T09:51:40.748] Launching batch job 243317 for UID 57337 [2018-12-12T09:51:40.762] [243317.batch] task/cgroup: /slurm/uid_57337/job_243317: alloc=0MB mem.limit=193080MB memsw.limit=unlimited [2018-12-12T09:51:40.763] [243317.batch] task/cgroup: /slurm/uid_57337/job_243317/step_batch: alloc=0MB mem.limit=193080MB memsw.limit=unlimited When PrologFlags=contain is activated I find the following... -- I don't see the "_run_prolog" and the "task/cgroup" messages in the slurmd logs -- The job prolog fails, the job fails and the job output is owned by root -- The compute node is drained. sinfo -lN | grep red017 .... red017 1 batch* drained 40 2:20:1 190000 0 1 (null) batch job complete f Here is the extract from the slurmd logs [2018-12-12T09:56:54.564] error: Waiting for JobId=243321 prolog has failed, giving up after 50 sec [2018-12-12T09:56:54.565] Could not launch job 243321 and not able to requeue it, cancelling job Best regards, David