Colas, We had a similar experience a long time ago, and we solved it by adding the following SchedulerParameters:
max_rpc_cnt=150,defer HTH, John DeSantis On Thu, 11 Jan 2018 16:39:43 -0500 Colas Rivière <rivi...@umdgrb.umd.edu> wrote: > Hello, > > I'm managing a small cluster (one head node, 24 workers, 1160 total > worker threads). The head node has two E5-2680 v3 CPUs > (hyper-threaded), ~100 GB of memory and spinning disks. > The head node becomes occasionally less responsive when there are > more than 10k jobs in queue, and becomes really unmanageable when > reaching 100k jobs in queue, with error messages such as: > > sbatch: error: Slurm temporarily unable to accept job, sleeping and > > retrying. > or > > Running: slurm_load_jobs error: Socket timed out on send/recv > > operation > Is that normal to experience slowdowns when the queue reaches this > few 10k jobs? What limit should I expect? Would adding a SSD drive > for SlurmdSpoolDir help? What can be done to push this limit? > > The cluster runs Slurm 17.02.4 on CentOS 6 and the config is attached > (from `scontrol show config`). > > Thanks, > Colas