On 17/04/2019 18.54, Yang Liu wrote: > We often received errors due to socket time out on send/recv opeartion: > > slurm_load_jobs error: Socket timed out on send/recv operation > slurm_load_node: Socket timed out on send/recv operation > > > What could cause the errors? How likely job_submit.lua could cause such > errors? We have a program running every 2 seconds collect information of > pending jobs. Does that program cause the errors?
Maybe the slurm controller is overloaded, so in that case every load that you reduce helps. However, even if the controller isn't generally overloaded, there can still be occasional spikes causing these kinds of issues. We used to suffer from these errors as well, in our case it was enough to bump somaxconn and tcp_max_syn_backlog (we use 4096 for both). See also https://slurm.schedmd.com/high_throughput.html -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi