Not sure about automatically canceling a job array, except
perhaps by submitting 2 consecutive arrays - first of size 20, and
the other with the rest of the elements and a dependency of
afterok. That said, a single job in a job array in Slurm
documentation is referred to as a task. I personally prefer
element, as in array element.
Consider creating a batch job with:
arrayid=$(sbatch --parsable --array=0-19 array-job.sh) sbatch --dependency=afterok:$arrayid --array=20-50000 array-job.sh
I'm not near a cluster right now, so can't test for correctness.
The main drawback is of course if 20 jobs takes a long time to
complete, and there are enough resources to run more than 20 jobs
in parallel, all those resources will be wasted for the duration.
Not a big issue in busy clusters, as some other job will run in
the meantime, but this will impact completion time of the array,
if 20 jobs use significantly less than the resources available.
It might be possible to depend on afternotok of the first 20 tasks, to run --wrap="scancel $arrayid"
Maybe something like:
sbatch --array=1-50000 array-job.sh with cat array-job.sh
The other OTHER approach might be to use some epilog (or possibly
epilogslurmctld) to log exit codes for first 20 tasks in each
array, and cancel the array if non-zero. This is a global approach
which will affect all job arrays, so might not be appropriate for
your use case.
On 01/08/2023 16:48:47, Josef Dvoracek
wrote:
my users found the beauty of job arrays, and they tend to use it every then and now. -- Regards, --Dani_L. |
- [slurm-users] stopping job array after N failed jobs in... Josef Dvoracek
- Re: [slurm-users] stopping job array after N faile... Daniel Letai
- Re: [slurm-users] stopping job array after N f... Loris Bennett
- Re: [slurm-users] stopping job array after N f... Michael DiDomenico