[slurm-users] Cloud nodes remain in state "alloc#"

2020-10-24 Thread Rupert Madden-Abbott
Hi, I'm using Slurm's elastic compute functionality to spin up nodes in the cloud, alongside a controller which is also in the cloud. When executing a job, Slurm correctly places a node into the state "alloc#" and calls my resume program. My resume program successfully provisions the cloud node a

Re: [slurm-users] Jobs stuck in "completing" (CG) state

2020-10-24 Thread Chris Samuel
On 10/24/20 9:22 am, Kimera Rodgers wrote: [root@kla-ac-ohpc-01 critical]# srun -c 8 --pty bash -i srun: error: slurm_receive_msgs: Socket timed out on send/recv operation srun: error: Task launch for 37.0 failed on node c-node3: Socket timed out on send/recv operation srun: error: Application

Re: [slurm-users] Jobs stuck in "completing" (CG) state

2020-10-24 Thread Paul Edmon
This can happen if the underlying storage is wedged.  I would check that it is working properly. Usually the only way to clear this state is either fix the stuck storage or reboot the node. -Paul Edmon- On 10/24/2020 12:22 PM, Kimera Rodgers wrote: I'm setting up slume on OpenHPC cluster wit

[slurm-users] Jobs stuck in "completing" (CG) state

2020-10-24 Thread Kimera Rodgers
I'm setting up slume on OpenHPC cluster with one master node and 5 compute nodes. When I run test jobs, the jobs completely get stuck in the CG state. Can someone help me hint on where I might have gone wrong? [root@kla-ac-ohpc-01 critical]# srun -c 8 --pty bash -i srun: error: slurm_receive_msgs

Re: [slurm-users] pam_slurm_adopt always claims now active jobs even when they do

2020-10-24 Thread Juergen Salk
Hi Paul, maybe this is totally unrelated but we also have a similar issue with pam_slurm_adopt in case that ConstrainRAMSpace=no is set in cgroup.conf and more than one job is running on that node. There is a bug report open at: https://bugs.schedmd.com/show_bug.cgi?id=9355 As a workaround we