Hi,
I'm using Slurm's elastic compute functionality to spin up nodes in the
cloud, alongside a controller which is also in the cloud.
When executing a job, Slurm correctly places a node into the state "alloc#"
and calls my resume program. My resume program successfully provisions the
cloud node a
On 10/24/20 9:22 am, Kimera Rodgers wrote:
[root@kla-ac-ohpc-01 critical]# srun -c 8 --pty bash -i
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: Task launch for 37.0 failed on node c-node3: Socket timed
out on send/recv operation
srun: error: Application
This can happen if the underlying storage is wedged. I would check that
it is working properly.
Usually the only way to clear this state is either fix the stuck storage
or reboot the node.
-Paul Edmon-
On 10/24/2020 12:22 PM, Kimera Rodgers wrote:
I'm setting up slume on OpenHPC cluster wit
I'm setting up slume on OpenHPC cluster with one master node and 5 compute
nodes.
When I run test jobs, the jobs completely get stuck in the CG state.
Can someone help me hint on where I might have gone wrong?
[root@kla-ac-ohpc-01 critical]# srun -c 8 --pty bash -i
srun: error: slurm_receive_msgs
Hi Paul,
maybe this is totally unrelated but we also have a similar issue with
pam_slurm_adopt in case that ConstrainRAMSpace=no is set in
cgroup.conf and more than one job is running on that node. There is a
bug report open at:
https://bugs.schedmd.com/show_bug.cgi?id=9355
As a workaround we