Hi John thanks for the info. slurmctld doesn't report anything about the server thread count in the logs and sdiag show only 3 server threads.
We changed the MessageTimeout value to 20. I'll let you know if it solves the problem. Thanks ale ----- Original Message ----- > From: "John DeSantis" <desan...@usf.edu> > To: "Alessandro Federico" <a.feder...@cineca.it> > Cc: slurm-users@lists.schedmd.com, "Isabella Baccarelli" > <i.baccare...@cineca.it>, hpc-sysmgt-i...@cineca.it > Sent: Friday, January 12, 2018 7:58:38 PM > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv > operation > > Ciao Alessandro, > > > Do we have to apply any particular setting to avoid incurring the > > problem? > > What is your "MessageTimeout" value in slurm.conf? If it's at the > default of 10, try changing it to 20. > > I'd also check and see if the slurmctld log is reporting anything > pertaining to the server thread count being over its limit. > > HTH, > John DeSantis > > On Fri, 12 Jan 2018 11:32:57 +0100 > Alessandro Federico <a.feder...@cineca.it> wrote: > > > Hi all, > > > > > > we are setting up SLURM 17.11.2 on a small test cluster of about > > 100 > > nodes. Sometimes we get the error in the subject when running any > > SLURM command (e.g. sinfo, squeue, scontrol reconf, etc...) > > > > > > Do we have to apply any particular setting to avoid incurring the > > problem? > > > > > > We found this bug report > > https://bugs.schedmd.com/show_bug.cgi?id=4002 but it regards the > > previous SLURM version and we do not set debug3 on slurmctld. > > > > > > thanks in advance > > ale > > > > -- Alessandro Federico HPC System Management Group System & Technology Department CINECA www.cineca.it Via dei Tizii 6, 00185 Rome - Italy phone: +39 06 44486708 All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy. All work and no play makes Jack...