Re: [slurm-users] SLURM 17.02.9 slurmctld unresponsive with server_thread_count over limit, waiting in syslog

Paul Edmon Wed, 08 Nov 2017 13:00:31 -0800

So hangups like this can occur due to the slurmdbd being busy withrequests. I've seen that happen when an ill timed massive sacct requesthits when slurmdbd is doing its roll up. In that case the slurmctldhangs while slurmdbd is busy. Typically in this case restartingmysql/slurmdbd seems to fix the issue.

Otherwise this can happen due to massive traffic to the slurmctld. Youcan try using the defer option for the SchedulerParamters. That slowsdown the scheduler so it can handle the additional load.


-Paul Edmon-


On 11/8/2017 3:11 PM, Sean Caron wrote:

Hi all,
I see SLURM 17.02.9 slurmctld hang or become unresponsive every fewdays with the message in syslog:
server_thread_count over limit (256), waiting
I believe from the user perspective they see "Socket timed out onsend/recv operation". Slurmctld never seems to recover once it's inthis state and will not respond to /etc/init.d/slurm restart. Onlyafter an admin does a kill -9 and restarts slurmctld does it snap back.
I don't see anything else in the logs that looks like an error messagethat would help diagnose what is going on, even with log level debug3on the SLURM controller daemon.
I monitor CPU and memory utilization with "htop" on the machinerunning the controller daemon and it doesn't seem like it'soverwhelmed by slurmctld load or anything like that.
Machine running the controller daemon feels reasonable for the task,for the size of our cluster. It's a repurposed Dell PowerEdge R410with 24 threads and 32 GB physical. Unless I'm way off?
I tried all kinds of SchedulerParameter tweaks on sched/backfill andeven set the scheduler back to sched/builtin and it's still happening.Didn't seem to affect the frequency much, either.
Any thoughts what could be causing SLURM to spawn so many threads andhang up?
Our cluster is medium-sized, we probably have a few thousand jobs inthe queue on average at any given time.
Monitoring with sdiag, the max cycle time of main scheduler nevercracks 2 seconds. This seems reasonable?
Thanks,

Sean

Re: [slurm-users] SLURM 17.02.9 slurmctld unresponsive with server_thread_count over limit, waiting in syslog

Reply via email to