Re: [slurm-users] slurmctld hanging

Loris Bennett Thu, 28 Jul 2022 22:59:13 -0700

Hi Byron,

byron <lbgpub...@gmail.com> writes:

> Hi Loris - about a second

What is the use-case for that?  Are these individual jobs or it a job
array.  Either way it sounds to me like a very bad idea.  On our system,
jobs which can start immediately because resources are available, still
take a few seconds to start running (I'm looking at the values for
'submit' and 'start' from 'sacct').  If a one-second job has to wait for
just a minute, the ration of wait-time to run-time is already
disproportionately large. 

Why doesn't the user bundle these individual jobs together?  Depending
on your maximum run-time and to what degree jobs can make use of
backfill, I would tell the user something between a single job and
maybe 100 job.  I certainly wouldn't allow one-second jobs in any
significant numbers on our system.

I think having a job starting every second is causing your slurmdbd to
timeout and that is the error you are seeing.

Regards

Loris

> On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett <loris.benn...@fu-berlin.de> 
> wrote:
>
>  Hi Byron,
>
>  byron <lbgpub...@gmail.com> writes:
>
>  > Hi 
>  >
>  > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally 
> (3 times in 2 months) have slurmctld hanging so we get the following message 
> when running sinfo
>  >
>  > “slurm_load_jobs error: Socket timed out on send/recv operation”
>  >
>  > It only seems to happen when one of our users runs a job that submits a 
> short lived job every second for 5 days (up to 90,000 in a day).  Although 
> that could be a red-herring.  
>
>  What's your definition of a 'short lived job'?
>
>  > There is nothing to be found in the slurmctld log.
>  >
>  > Can anyone suggest how to even start troubleshooting this?  Without 
> anything in the logs I dont know where to start.
>  >
>  > Thanks
>
>  Cheers,
>
>  Loris
>
>  -- 
>  Dr. Loris Bennett (Herr/Mr)
>  ZEDAT, Freie Universität Berlin         Email loris.benn...@fu-berlin.de

Re: [slurm-users] slurmctld hanging

Reply via email to