Hi Byron, byron <lbgpub...@gmail.com> writes:
> Hi > > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally (3 > times in 2 months) have slurmctld hanging so we get the following message > when running sinfo > > “slurm_load_jobs error: Socket timed out on send/recv operation” > > It only seems to happen when one of our users runs a job that submits a short > lived job every second for 5 days (up to 90,000 in a day). Although that > could be a red-herring. What's your definition of a 'short lived job'? > There is nothing to be found in the slurmctld log. > > Can anyone suggest how to even start troubleshooting this? Without anything > in the logs I dont know where to start. > > Thanks Cheers, Loris -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de