Hi, all We have one cluster with Slurm version 20.11.8 in CentOS 8.2. Suddenly it produces a wired problem proid for *only Pending job* will be cancelled since transport endpoint is not connected error(See image https://user-images.githubusercontent.com/19144683/229037078-ca704ba8-23a4-4948-9d1a-bacab82acd1f.png). The all jobs are submitted with srun command. ... ... srun:job 6367724 queued and waiting for resources srun:error:Unable to allocate resources: Transport endpoint is not connected srun:job 6367725 queued and waiting for resources srun:error: Unable to allocate resources: Transport endpoint is not connected srun:job 6367726 queued and waiting for resources srun:job 6367727 queued and waiting for resources srun:job 6367728 queued and waiting for resources srun:error: Unable to allocate resources: Transport endpoint is not connected srun:Force Terminated job 6366908
[root@slurm-master01 bin]# journalctl --since today -p err __COMM=slurmctld Mar 31 02:50:46 slurm-master01 slurmctld[220654]: error: slurm_receive_msgs: Transport endpoint is not connected Mar 31 02:50:47 slurm-master01 slurmctld[220654]: error: slurm receive_msgs: Transport endpoint is not connected According to https://github.com/SchedMD/slurm/blob/slurm-20-11-8-1/src/srun/libsrun/allocate.c#L182-L227 , it seems OS issue? I've google for "transport endpoint is not connected", lots of references report that filesystem IO issue.So: * How to avoid pending job will be cancelled for slurm * What caused the slurmctld reported error Thanks!