Re: [slurm-users] stopping job array after N failed jobs in row

Loris Bennett Tue, 01 Aug 2023 22:47:49 -0700

Daniel Letai <d...@letai.org.il> writes:

> Not sure about automatically canceling a job array, except perhaps by 
> submitting 2 consecutive arrays - first of size 20, and the other with the 
> rest of
> the elements and a dependency of afterok. That said, a single job in a job 
> array in Slurm documentation is referred to as a task. I personally prefer
> element, as in array element.
>
> Consider creating a batch job with:
>
> arrayid=$(sbatch --parsable --array=0-19 array-job.sh)
>
> sbatch --dependency=afterok:$arrayid --array=20-50000 array-job.sh
>
> I'm not near a cluster right now, so can't test for correctness. The main 
> drawback is of course if 20 jobs takes a long time to complete, and there are
> enough resources to run more than 20 jobs in parallel, all those resources 
> will be wasted for the duration. Not a big issue in busy clusters, as some
> other job will run in the meantime, but this will impact completion time of 
> the array, if 20 jobs use significantly less than the resources available.


I think running an initial subarray is a good idea, since, once it has
completed, it allows the user to check whether the right amount of
resources were requested.  I often find users don't do this and end up,
say, specifying 10 or 100 times more memory than actually needed for an
array of several thousand jobs.  This is obviously a problem even if the
jobs all completed successfully.

Cheers,

Loris

> It might be possible to depend on afternotok of the first 20 tasks, to run 
> --wrap="scancel $arrayid"
>
> Maybe something like:
>
> sbatch --array=1-50000 array-job.sh
>
> with
>
> cat array-job.sh
>
>  #!/bin/bash
>
>  srun myjob.sh $SLURM_ARRAY_TASK_ID & 
>
>  [[ $SLURM_ARRAY_TASK_ID -gt 20  ]] && srun -d 
> afternotok:${SLURM_ARRAY_JOB_ID}_1,afternotok:${SLURM_ARRAY_JOB_ID}_2,...afternotok:$
>  {SLURM_ARRAY_JOB_ID}_20 scancel $SLURM_ARRAY_JOB_ID
>
> Will also work. Untested, use at your own risk.
>
> The other OTHER approach might be to use some epilog (or possibly 
> epilogslurmctld) to log exit codes for first 20 tasks in each array, and 
> cancel the
> array if non-zero. This is a global approach which will affect all job 
> arrays, so might not be appropriate for your use case.
>
> On 01/08/2023 16:48:47, Josef Dvoracek wrote:
>
>>  my users found the beauty of job arrays, and they tend to use it every then 
>> and now. 
>>
>>  Sometimes human factor steps in, and something is wrong in job array 
>> specification, and cluster "works" on one failed array job after another. 
>>
>>  Isn't there any way how to automatically stop/scancel/? job array after, 
>> let say, 20 failed array jobs in row? 
>>
>>  So far my experience is, if first ~20 array jobs go right, there is no 
>> catastrophic failure in sbatch-file. If they fail, usually it's bad and 
>> there is no
>>  sense to crunch the remaining thousands of job array jobs. 
>>
>>  OT: what is the correct terminology for one item in job array... sub-job? 
>> job-array-job? :) 
>>
>>  cheers 
>>
>>  josef 
>-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin

Re: [slurm-users] stopping job array after N failed jobs in row

Reply via email to