Hi Byron, byron <lbgpub...@gmail.com> writes:
> Hi Loris - about a second What is the use-case for that? Are these individual jobs or it a job array. Either way it sounds to me like a very bad idea. On our system, jobs which can start immediately because resources are available, still take a few seconds to start running (I'm looking at the values for 'submit' and 'start' from 'sacct'). If a one-second job has to wait for just a minute, the ration of wait-time to run-time is already disproportionately large. Why doesn't the user bundle these individual jobs together? Depending on your maximum run-time and to what degree jobs can make use of backfill, I would tell the user something between a single job and maybe 100 job. I certainly wouldn't allow one-second jobs in any significant numbers on our system. I think having a job starting every second is causing your slurmdbd to timeout and that is the error you are seeing. Regards Loris > On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett <loris.benn...@fu-berlin.de> > wrote: > > Hi Byron, > > byron <lbgpub...@gmail.com> writes: > > > Hi > > > > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally > (3 times in 2 months) have slurmctld hanging so we get the following message > when running sinfo > > > > “slurm_load_jobs error: Socket timed out on send/recv operation” > > > > It only seems to happen when one of our users runs a job that submits a > short lived job every second for 5 days (up to 90,000 in a day). Although > that could be a red-herring. > > What's your definition of a 'short lived job'? > > > There is nothing to be found in the slurmctld log. > > > > Can anyone suggest how to even start troubleshooting this? Without > anything in the logs I dont know where to start. > > > > Thanks > > Cheers, > > Loris > > -- > Dr. Loris Bennett (Herr/Mr) > ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de