Re: [slurm-users] is there a way to delay the scheduling.

Brian Andrus Fri, 28 Aug 2020 07:35:32 -0700

Seems if they are really that short, it would be better to have a singlejob run through them all, or 10 jobs run through 2000 each kind of thing.

Such short jobs take more time for setup/teardown than the job itself,making this approach inefficient. The amount of resources used just toschedule them in that fashion outweighs the resources needed by far.


Brian Andrus

On 8/28/2020 3:30 AM, navin srivastava wrote:

Hi Team,
facing one issue. several users submitting 20000 job in a single batchjob which is very short jobs( says 1-2 sec). so while submitting morejob slurmctld become unresponsive and started giving message
ending job 6e508a88155d9bec40d752c8331d7ae8 to queue.
sbatch: error: Batch job submission failed: Unable to contact slurmcontroller (connect failure)
Sending job 6e51ed0e322c87802b0f3a2f23a7967f to queue.
sbatch: error: Batch job submission failed: Unable to contact slurmcontroller (connect failure)
Sending job 6e638939f90cd59e60c23b8450af9839 to queue.
sbatch: error: Batch job submission failed: Unable to contact slurmcontroller (connect failure)
Sending job 6e6acf36bc7e1394a92155a95feb1c92 to queue.
sbatch: error: Batch job submission failed: Unable to contact slurmcontroller (connect failure)
Sending job 6e6c646a29f0ad4e9df35001c367a9f5 to queue.
sbatch: error: Batch job submission failed: Unable to contact slurmcontroller (connect failure)
Sending job 6ebcecb4c27d88f0f48d402e2b079c52 to queue.
even that time the load of cpu started consuming more than 100% ofslurmctld process.I found that the node is not able to acknowledge immediately toserver. it is moving from comp to idle.so in my thought delay a scheduling cycle will help here. any idea howit can be done.
so is there any other solution available for such issues.

Regards
Navin.

Re: [slurm-users] is there a way to delay the scheduling.

Reply via email to