Hi,

 We have a running slurm cluster and users have been submitting jobs for
the past 3 months without any issues. Recently, some nodes jobs are getting
drained randomly due to the reason "prolog error"
Our slurm.conf has these 2 lines regarding prolog:
PrologFlags=Contain,Alloc,X11
Prolog=/slurm_stuff/bin/prolog.d/prolog*

Inside the prolog.d folder, there are 2 scripts which run with no errors as
far as I can see but is there a way to debug why the nodes are going in
draining mode once in a while because of "prolog error"? That seems to
happen at random times and on random nodes.

>From the log file, I can see only this:

Oct 06 00:57:43 pgpu008.chicagobooth.edu slurmd[3709622]: slurmd: error:
prolog failed: rc:230 output:Successfully started proces>

Oct 06 00:57:43 pgpu008.chicagobooth.edu slurmd[3709622]: slurmd: error:
[job 20398] prolog failed status=230:0

Oct 06 00:57:43 pgpu008 slurmd[3709622]: slurmd: Job 20398 already killed,
do not launch batch job

Oct 06 13:06:23 pgpu008 systemd[1]: Stopping Slurm node daemon...

Oct 06 13:06:23 pgpu008 slurmd[3709622]: slurmd: Caught SIGTERM. Shutting
down.

Oct 06 13:06:23 pgpu008 slurmd[3709622]: slurmd: Slurmd shutdown completing


Currently, now the job 20398 that is getting killed in the picture above is
in the state "Launch failed requeue held" after I resume the node.








*Fritz Ratnasamy*Data Scientist
Information Technology
-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to