byron <lbgpub...@gmail.com> writes: > Yep, the question of how he has the job set up is an ongoing conversation, > but for now it is staying like this and I have to make do.
Wow, your user must have friends in high places, if he gets to do some thing as goofy as starting a one-second job every second. > Even with all the traffic he is generating though (at worst 1 a second over > the course of a day) I would still have though that slurm was capable of > managing that. And it was, until I did the upgrade. Maybe you were just lucky. Aren't blocks of jobs going to start simultaneously if, say, a large MPI-job ends and multiple nodes become available at the same time? And if there a delay of more than a second those jobs starting, isn't the number of pending jobs just going to increase until the user hits MaxSubmitJobs? What happens then? Or do the friends in high places ensure that the priorities of this user's jobs are always higher than everyone else's? Cheers, Loris > On Fri, Jul 29, 2022 at 7:00 AM Loris Bennett <loris.benn...@fu-berlin.de> > wrote: > > Hi Byron, > > byron <lbgpub...@gmail.com> writes: > > > Hi Loris - about a second > > What is the use-case for that? Are these individual jobs or it a job > array. Either way it sounds to me like a very bad idea. On our system, > jobs which can start immediately because resources are available, still > take a few seconds to start running (I'm looking at the values for > 'submit' and 'start' from 'sacct'). If a one-second job has to wait for > just a minute, the ration of wait-time to run-time is already > disproportionately large. > > Why doesn't the user bundle these individual jobs together? Depending > on your maximum run-time and to what degree jobs can make use of > backfill, I would tell the user something between a single job and > maybe 100 job. I certainly wouldn't allow one-second jobs in any > significant numbers on our system. > > I think having a job starting every second is causing your slurmdbd to > timeout and that is the error you are seeing. > > Regards > > Loris > > > On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett <loris.benn...@fu-berlin.de> > wrote: > > > > Hi Byron, > > > > byron <lbgpub...@gmail.com> writes: > > > > > Hi > > > > > > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we > occasionally (3 times in 2 months) have slurmctld hanging so we get the > following message when running sinfo > > > > > > “slurm_load_jobs error: Socket timed out on send/recv operation” > > > > > > It only seems to happen when one of our users runs a job that submits a > short lived job every second for 5 days (up to 90,000 in a day). Although > that could be a red-herring. > > > > What's your definition of a 'short lived job'? > > > > > There is nothing to be found in the slurmctld log. > > > > > > Can anyone suggest how to even start troubleshooting this? Without > anything in the logs I dont know where to start. > > > > > > Thanks > > > > Cheers, > > > > Loris > > > > -- > > Dr. Loris Bennett (Herr/Mr) > > ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de > -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de