Hi, We have a running slurm cluster and users have been submitting jobs for the past 3 months without any issues. Recently, some nodes jobs are getting drained randomly due to the reason "prolog error" Our slurm.conf has these 2 lines regarding prolog: PrologFlags=Contain,Alloc,X11 Prolog=/slurm_stuff/bin/prolog.d/prolog*
Inside the prolog.d folder, there are 2 scripts which run with no errors as far as I can see but is there a way to debug why the nodes are going in draining mode once in a while because of "prolog error"? That seems to happen at random times and on random nodes. >From the log file, I can see only this: Oct 06 00:57:43 pgpu008.chicagobooth.edu slurmd[3709622]: slurmd: error: prolog failed: rc:230 output:Successfully started proces> Oct 06 00:57:43 pgpu008.chicagobooth.edu slurmd[3709622]: slurmd: error: [job 20398] prolog failed status=230:0 Oct 06 00:57:43 pgpu008 slurmd[3709622]: slurmd: Job 20398 already killed, do not launch batch job Oct 06 13:06:23 pgpu008 systemd[1]: Stopping Slurm node daemon... Oct 06 13:06:23 pgpu008 slurmd[3709622]: slurmd: Caught SIGTERM. Shutting down. Oct 06 13:06:23 pgpu008 slurmd[3709622]: slurmd: Slurmd shutdown completing Currently, now the job 20398 that is getting killed in the picture above is in the state "Launch failed requeue held" after I resume the node. *Fritz Ratnasamy*Data Scientist Information Technology
-- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
