date:20140718

Feature request: halt on threshold

2014-07-18 Thread Ben Rusholme

Hi, There are currently three options to "—halt" - ignore (0), stop new jobs (1), or kill everything (2). I propose an additional option; to set the number of job failures before doing anything. This would then allow some tolerance of failure but would catch global problems. Consider this exa

Re: Feature request: halt on threshold

2014-07-18 Thread Ole Tange

On Fri, Jul 18, 2014 at 11:22 PM, Ben Rusholme wrote: > There are currently three options to "—halt" - ignore (0), stop new jobs (1), > or kill everything (2). > > I propose an additional option; to set the number of job failures before > doing anything. This would then allow some tolerance of

Re: Feature request: halt on threshold

2014-07-18 Thread Ben Rusholme

> You need to give a reproducible example where you cannot just use > --halt 0 and then later --resume-failed when you have fixed the > bug/the input data. My use case is processing large amounts of data on a heavily-subscribed cluster with *days* of queue time. Re-queueing is expensive. Cheers