Hey, you can use the 'defer' scheduler parameter: https://slurm.schedmd.com/sched_config.html if you don't require immediate start of jobs.
best regards Maciej Pawlik pt., 28 sie 2020 o 12:32 navin srivastava <navin.alt...@gmail.com> napisaĆ(a): > Hi Team, > > facing one issue. several users submitting 20000 job in a single batch job > which is very short jobs( says 1-2 sec). so while submitting more job > slurmctld become unresponsive and started giving message > > ending job 6e508a88155d9bec40d752c8331d7ae8 to queue. > sbatch: error: Batch job submission failed: Unable to contact slurm > controller (connect failure) > Sending job 6e51ed0e322c87802b0f3a2f23a7967f to queue. > sbatch: error: Batch job submission failed: Unable to contact slurm > controller (connect failure) > Sending job 6e638939f90cd59e60c23b8450af9839 to queue. > sbatch: error: Batch job submission failed: Unable to contact slurm > controller (connect failure) > Sending job 6e6acf36bc7e1394a92155a95feb1c92 to queue. > sbatch: error: Batch job submission failed: Unable to contact slurm > controller (connect failure) > Sending job 6e6c646a29f0ad4e9df35001c367a9f5 to queue. > sbatch: error: Batch job submission failed: Unable to contact slurm > controller (connect failure) > Sending job 6ebcecb4c27d88f0f48d402e2b079c52 to queue. > > even that time the load of cpu started consuming more than 100% of > slurmctld process. > I found that the node is not able to acknowledge immediately to server. it > is moving from comp to idle. > so in my thought delay a scheduling cycle will help here. any idea how it > can be done. > > so is there any other solution available for such issues. > > Regards > Navin. > > > >