Re: [slurm-users] stopping job array after N failed jobs in row

2023-08-02 Thread Michael DiDomenico
On Tue, Aug 1, 2023 at 3:27 PM Daniel Letai wrote: > The other OTHER approach might be to use some epilog (or possibly > epilogslurmctld) to log exit codes for first 20 tasks in each array, and > cancel the array if non-zero. This is a global approach which will affect all > job arrays, so migh

Re: [slurm-users] stopping job array after N failed jobs in row

2023-08-01 Thread Loris Bennett
Daniel Letai writes: > Not sure about automatically canceling a job array, except perhaps by > submitting 2 consecutive arrays - first of size 20, and the other with the > rest of > the elements and a dependency of afterok. That said, a single job in a job > array in Slurm documentation is ref

Re: [slurm-users] stopping job array after N failed jobs in row

2023-08-01 Thread Daniel Letai
Not sure about automatically canceling a job array, except perhaps by submitting 2 consecutive arrays - first of size 20, and the other with the rest of the elements and a dependency of afterok. That said, a single job in a job array in Slurm documentation is refe

[slurm-users] stopping job array after N failed jobs in row

2023-08-01 Thread Josef Dvoracek
my users found the beauty of job arrays, and they tend to use it every then and now. Sometimes human factor steps in, and something is wrong in job array specification, and cluster "works" on one failed array job after another. Isn't there any way how to automatically stop/scancel/? job array