Hi,
There are currently three options to "—halt" - ignore (0), stop new jobs (1),
or kill everything (2).
I propose an additional option; to set the number of job failures before doing
anything. This would then allow some tolerance of failure but would catch
global problems.
Consider this exa
On Fri, Jul 18, 2014 at 11:22 PM, Ben Rusholme wrote:
> There are currently three options to "—halt" - ignore (0), stop new jobs (1),
> or kill everything (2).
>
> I propose an additional option; to set the number of job failures before
> doing anything. This would then allow some tolerance of
> You need to give a reproducible example where you cannot just use
> --halt 0 and then later --resume-failed when you have fixed the
> bug/the input data.
My use case is processing large amounts of data on a heavily-subscribed cluster
with *days* of queue time. Re-queueing is expensive.
Cheers