Thanks, Paul. I've played with SchedulerParameters=defer,... in and out of the configuration per various suggestions in various SLURM bug tracker threads that I looked at, but this was probably when we were still focusing on trying to get sched/backfill playing ball. I will try again now that we're back to sched/builtin and see if that helps. If we still see issues, I'll look at a potential DBD performance bottleneck and try some of those MySQL tuning suggestions I've seen in the docs.
Best, Sean On Wed, Nov 8, 2017 at 3:57 PM, Paul Edmon <ped...@cfa.harvard.edu> wrote: > So hangups like this can occur due to the slurmdbd being busy with > requests. I've seen that happen when an ill timed massive sacct request > hits when slurmdbd is doing its roll up. In that case the slurmctld hangs > while slurmdbd is busy. Typically in this case restarting mysql/slurmdbd > seems to fix the issue. > > Otherwise this can happen due to massive traffic to the slurmctld. You > can try using the defer option for the SchedulerParamters. That slows down > the scheduler so it can handle the additional load. > > -Paul Edmon- > > > > On 11/8/2017 3:11 PM, Sean Caron wrote: > >> Hi all, >> >> I see SLURM 17.02.9 slurmctld hang or become unresponsive every few days >> with the message in syslog: >> >> server_thread_count over limit (256), waiting >> >> I believe from the user perspective they see "Socket timed out on >> send/recv operation". Slurmctld never seems to recover once it's in this >> state and will not respond to /etc/init.d/slurm restart. Only after an >> admin does a kill -9 and restarts slurmctld does it snap back. >> >> I don't see anything else in the logs that looks like an error message >> that would help diagnose what is going on, even with log level debug3 on >> the SLURM controller daemon. >> >> I monitor CPU and memory utilization with "htop" on the machine running >> the controller daemon and it doesn't seem like it's overwhelmed by >> slurmctld load or anything like that. >> >> Machine running the controller daemon feels reasonable for the task, for >> the size of our cluster. It's a repurposed Dell PowerEdge R410 with 24 >> threads and 32 GB physical. Unless I'm way off? >> >> I tried all kinds of SchedulerParameter tweaks on sched/backfill and even >> set the scheduler back to sched/builtin and it's still happening. Didn't >> seem to affect the frequency much, either. >> >> Any thoughts what could be causing SLURM to spawn so many threads and >> hang up? >> >> Our cluster is medium-sized, we probably have a few thousand jobs in the >> queue on average at any given time. >> >> Monitoring with sdiag, the max cycle time of main scheduler never cracks >> 2 seconds. This seems reasonable? >> >> Thanks, >> >> Sean >> >> > >