[slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

Andy Riebs Wed, 29 Nov 2017 15:57:01 -0800

We've just installed 17.11.0 on our 100+ node x86_64 cluster runningCentOS 7.4 this afternoon, and periodically see a single node (perhapsthe first node in an allocation?) get drained with the message "batchjob complete failure".


On one node in question, slurmd.log reports


   pam_unix(slurm:session): open_session - error recovering username
   pam_loginuid(slurm:session): unexpected response from failed

conversation function

On another node drained for the same reason,

   error: pam_open_session: Cannot make/remove an entry for the
   specified session
   error: error in pam_setup
   error: job_manager exiting abnormally, rc = 4020
   sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0

slurmctld has logged

   error: slurmd error running JobId=33 on node(s)=node048: Slurmd
   could not execve job

   drain_nodes: node Summer0c048 state set to DRAIN

It's been a long day (for other reasons), so I'll go dig into thistomorrow. But if anyone can shine some light on where I should startlooking, I shall be most obliged!


Andy

--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
    May the source be with you!

[slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

Reply via email to