Hi Byron, does the slurmctld recover by itself or does It require a manual restart of the service? We had some deadlock issues related to MCS handling just after doing the 19->20->21 upgrades. I don't recall what fixed the issue but disabling MCS might be a good place to start if you are using it.
best regards Maciej Pawlik pt., 29 lip 2022 o 11:34 byron <lbgpub...@gmail.com> napisał(a): > Yep, the question of how he has the job set up is an ongoing conversation, > but for now it is staying like this and I have to make do. > > Even with all the traffic he is generating though (at worst 1 a second > over the course of a day) I would still have though that slurm was capable > of managing that. And it was, until I did the upgrade. > > > On Fri, Jul 29, 2022 at 7:00 AM Loris Bennett <loris.benn...@fu-berlin.de> > wrote: > >> Hi Byron, >> >> byron <lbgpub...@gmail.com> writes: >> >> > Hi Loris - about a second >> >> What is the use-case for that? Are these individual jobs or it a job >> array. Either way it sounds to me like a very bad idea. On our system, >> jobs which can start immediately because resources are available, still >> take a few seconds to start running (I'm looking at the values for >> 'submit' and 'start' from 'sacct'). If a one-second job has to wait for >> just a minute, the ration of wait-time to run-time is already >> disproportionately large. >> >> Why doesn't the user bundle these individual jobs together? Depending >> on your maximum run-time and to what degree jobs can make use of >> backfill, I would tell the user something between a single job and >> maybe 100 job. I certainly wouldn't allow one-second jobs in any >> significant numbers on our system. >> >> I think having a job starting every second is causing your slurmdbd to >> timeout and that is the error you are seeing. >> >> Regards >> >> Loris >> >> > On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett < >> loris.benn...@fu-berlin.de> wrote: >> > >> > Hi Byron, >> > >> > byron <lbgpub...@gmail.com> writes: >> > >> > > Hi >> > > >> > > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we >> occasionally (3 times in 2 months) have slurmctld hanging so we get the >> following message when running sinfo >> > > >> > > “slurm_load_jobs error: Socket timed out on send/recv operation” >> > > >> > > It only seems to happen when one of our users runs a job that >> submits a short lived job every second for 5 days (up to 90,000 in a day). >> Although that could be a red-herring. >> > >> > What's your definition of a 'short lived job'? >> > >> > > There is nothing to be found in the slurmctld log. >> > > >> > > Can anyone suggest how to even start troubleshooting this? Without >> anything in the logs I dont know where to start. >> > > >> > > Thanks >> > >> > Cheers, >> > >> > Loris >> > >> > -- >> > Dr. Loris Bennett (Herr/Mr) >> > ZEDAT, Freie Universität Berlin Email >> loris.benn...@fu-berlin.de >> >>