We've just installed 17.11.0 on our 100+ node x86_64 cluster running CentOS 7.4 this afternoon, and periodically see a single node (perhaps the first node in an allocation?) get drained with the message "batch job complete failure".

On one node in question, slurmd.log reports

   pam_unix(slurm:session): open_session - error recovering username
   pam_loginuid(slurm:session): unexpected response from failed
conversation function
On another node drained for the same reason,

   error: pam_open_session: Cannot make/remove an entry for the
   specified session
   error: error in pam_setup
   error: job_manager exiting abnormally, rc = 4020
   sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0

slurmctld has logged

   error: slurmd error running JobId=33 on node(s)=node048: Slurmd
   could not execve job

   drain_nodes: node Summer0c048 state set to DRAIN

It's been a long day (for other reasons), so I'll go dig into this tomorrow. But if anyone can shine some light on where I should start looking, I shall be most obliged!

Andy

--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
    May the source be with you!

Reply via email to