Re: [slurm-users] Queue size, slow/unresponsive head node

Nicholas C Santucci Thu, 11 Jan 2018 21:27:38 -0800

Why do you have?

SchedulerParameters     = (null)


Is that even allowed
?


https://slurm.schedmd.com/sched_config.html

On Thu, Jan 11, 2018 at 1:39 PM, Colas Rivière <rivi...@umdgrb.umd.edu>
wrote:

> Hello,
>
> I'm managing a small cluster (one head node, 24 workers, 1160 total worker
> threads). The head node has two E5-2680 v3 CPUs (hyper-threaded), ~100 GB
> of memory and spinning disks.
> The head node becomes occasionally less responsive when there are more
> than 10k jobs in queue, and becomes really unmanageable when reaching 100k
> jobs in queue, with error messages such as:
>
>> sbatch: error: Slurm temporarily unable to accept job, sleeping and
>> retrying.
>>
> or
>
>> Running: slurm_load_jobs error: Socket timed out on send/recv operation
>>
> Is that normal to experience slowdowns when the queue reaches this few 10k
> jobs? What limit should I expect? Would adding a SSD drive for
> SlurmdSpoolDir help? What can be done to push this limit?
>
> The cluster runs Slurm 17.02.4 on CentOS 6 and the config is attached
> (from `scontrol show config`).
>
> Thanks,
> Colas
>



-- 
Nick Santucci
santu...@uci.edu

Re: [slurm-users] Queue size, slow/unresponsive head node

Reply via email to