Daniel Letai <d...@letai.org.il> writes: > Not sure about automatically canceling a job array, except perhaps by > submitting 2 consecutive arrays - first of size 20, and the other with the > rest of > the elements and a dependency of afterok. That said, a single job in a job > array in Slurm documentation is referred to as a task. I personally prefer > element, as in array element. > > Consider creating a batch job with: > > arrayid=$(sbatch --parsable --array=0-19 array-job.sh) > > sbatch --dependency=afterok:$arrayid --array=20-50000 array-job.sh > > I'm not near a cluster right now, so can't test for correctness. The main > drawback is of course if 20 jobs takes a long time to complete, and there are > enough resources to run more than 20 jobs in parallel, all those resources > will be wasted for the duration. Not a big issue in busy clusters, as some > other job will run in the meantime, but this will impact completion time of > the array, if 20 jobs use significantly less than the resources available.
I think running an initial subarray is a good idea, since, once it has completed, it allows the user to check whether the right amount of resources were requested. I often find users don't do this and end up, say, specifying 10 or 100 times more memory than actually needed for an array of several thousand jobs. This is obviously a problem even if the jobs all completed successfully. Cheers, Loris > It might be possible to depend on afternotok of the first 20 tasks, to run > --wrap="scancel $arrayid" > > Maybe something like: > > sbatch --array=1-50000 array-job.sh > > with > > cat array-job.sh > > #!/bin/bash > > srun myjob.sh $SLURM_ARRAY_TASK_ID & > > [[ $SLURM_ARRAY_TASK_ID -gt 20 ]] && srun -d > afternotok:${SLURM_ARRAY_JOB_ID}_1,afternotok:${SLURM_ARRAY_JOB_ID}_2,...afternotok:$ > {SLURM_ARRAY_JOB_ID}_20 scancel $SLURM_ARRAY_JOB_ID > > Will also work. Untested, use at your own risk. > > The other OTHER approach might be to use some epilog (or possibly > epilogslurmctld) to log exit codes for first 20 tasks in each array, and > cancel the > array if non-zero. This is a global approach which will affect all job > arrays, so might not be appropriate for your use case. > > On 01/08/2023 16:48:47, Josef Dvoracek wrote: > >> my users found the beauty of job arrays, and they tend to use it every then >> and now. >> >> Sometimes human factor steps in, and something is wrong in job array >> specification, and cluster "works" on one failed array job after another. >> >> Isn't there any way how to automatically stop/scancel/? job array after, >> let say, 20 failed array jobs in row? >> >> So far my experience is, if first ~20 array jobs go right, there is no >> catastrophic failure in sbatch-file. If they fail, usually it's bad and >> there is no >> sense to crunch the remaining thousands of job array jobs. >> >> OT: what is the correct terminology for one item in job array... sub-job? >> job-array-job? :) >> >> cheers >> >> josef >-- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin