Re: [slurm-users] Jobs stuck in "completing" (CG) state

Paul Edmon Sat, 24 Oct 2020 10:14:07 -0700

This can happen if the underlying storage is wedged. I would check thatit is working properly.

Usually the only way to clear this state is either fix the stuck storageor reboot the node.


-Paul Edmon-

On 10/24/2020 12:22 PM, Kimera Rodgers wrote:

I'm setting up slume on OpenHPC cluster with one master node and 5compute nodes.
When I run test jobs, the jobs completely get stuck in the CG state.

Can someone help me hint on where I might have gone wrong?

[root@kla-ac-ohpc-01 critical]# srun -c 8 --pty bash -i
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: Task launch for 37.0 failed on node c-node3: Socket timedout on send/recv operationsrun: error: Application launch failed: Socket timed out on send/recvoperation
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

[root@kla-ac-ohpc-01 critical]# squeue
JOBID PARTITION NAME USER ST TIME NODESNODELIST(REASON) 36 normal bash test CG 0:53 2c-node[1-2]
                37    normal     bash     root CG       0:52    1 c-node3

Thank you.

Regards,
Rodgers

Re: [slurm-users] Jobs stuck in "completing" (CG) state

Reply via email to