Alessandro, You might want to consider tracking your Slurm scheduler diagnostics output with some type of time-series monitoring system. The time-based history has proven more helpful at times than log contents by themselves.
See Giovanni Torres' post on setting this up... http://giovannitorres.me/graphing-sdiag-with-graphite.html -- Trevor > On Jan 15, 2018, at 4:33 AM, Alessandro Federico <a.feder...@cineca.it> wrote: > > Hi John > > thanks for the info. > slurmctld doesn't report anything about the server thread count in the logs > and sdiag show only 3 server threads. > > We changed the MessageTimeout value to 20. > > I'll let you know if it solves the problem. > > Thanks > ale > > ----- Original Message ----- >> From: "John DeSantis" <desan...@usf.edu> >> To: "Alessandro Federico" <a.feder...@cineca.it> >> Cc: slurm-users@lists.schedmd.com, "Isabella Baccarelli" >> <i.baccare...@cineca.it>, hpc-sysmgt-i...@cineca.it >> Sent: Friday, January 12, 2018 7:58:38 PM >> Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv >> operation >> >> Ciao Alessandro, >> >>> Do we have to apply any particular setting to avoid incurring the >>> problem? >> >> What is your "MessageTimeout" value in slurm.conf? If it's at the >> default of 10, try changing it to 20. >> >> I'd also check and see if the slurmctld log is reporting anything >> pertaining to the server thread count being over its limit. >> >> HTH, >> John DeSantis >> >> On Fri, 12 Jan 2018 11:32:57 +0100 >> Alessandro Federico <a.feder...@cineca.it> wrote: >> >>> Hi all, >>> >>> >>> we are setting up SLURM 17.11.2 on a small test cluster of about >>> 100 >>> nodes. Sometimes we get the error in the subject when running any >>> SLURM command (e.g. sinfo, squeue, scontrol reconf, etc...) >>> >>> >>> Do we have to apply any particular setting to avoid incurring the >>> problem? >>> >>> >>> We found this bug report >>> https://bugs.schedmd.com/show_bug.cgi?id=4002 but it regards the >>> previous SLURM version and we do not set debug3 on slurmctld. >>> >>> >>> thanks in advance >>> ale >>> >> >> > > -- > Alessandro Federico > HPC System Management Group > System & Technology Department > CINECA www.cineca.it > Via dei Tizii 6, 00185 Rome - Italy > phone: +39 06 44486708 > > All work and no play makes Jack a dull boy. > All work and no play makes Jack a dull boy. > All work and no play makes Jack... >