Just a couple comments from experience in general:
1) If you can, either use xargs or parallel to do the forking so you can
limit the number of simultaneous submissions
2) I have yet to see where it is a good idea to have many separate jobs
when using an array can work.
If you can prep up a proper input file for a script, a single
submission is all it takes. Then you can control how many are currently
running (MaxArrayTask) and can change that to scale up/down.
Brian Andrus
On 8/25/2019 11:12 PM, Guillaume Perrault Archambault wrote:
Hello,
I wrote a regression-testing toolkit to manage large numbers of SLURM
jobs and their output (the toolkit can be found here
<https://github.com/gobbedy/slurm_simulation_toolkit/> if anyone is
interested).
To make job launching faster, sbatch commands are forked, so that
numerous jobs may be submitted in parallel.
We (the cluster admin and myself) are concerned that this may cause
unresponsiveness for other users.
I cannot say for sure since I don't have visibility over all users of
the cluster, but unresponsiveness doesn't seem to have occurred so
far. That being said, the fact that it hasn't occurred yet doesn't
mean it won't in the future. So I'm treating this as a ticking time
bomb to be fixed asap.
My questions are the following:
1) Does anyone have experience with large numbers of jobs submitted in
parallel? What are the limits that can be hit? For example is there
some hard limit on how many jobs a SLURM scheduler can handle before
blacking out / slowing down?
2) Is there a way for me to find/measure/ping this resource limit?
3) How can I make sure I don't hit this resource limit?
From what I've observed, parallel submission can improve submission
time by a factor at least 10x. This can make a big difference in
users' workflows.
For that reason I would like to keep the option of launching jobs
sequentially as a last resort.
Thanks in advance.
Regards,
Guillaume.