I was actually looking at something else (tm) when I noticed that two of our Slurm controlled resources had different config values for KillOnBadExit, and so I went looking for clues. I read this: KillOnBadExit If set to 1, a step will be terminated immediately if any task is crashed or aborted, as indicated by a non-zero exit code. With the default value of 0, if one of the processes is crashed or aborted the other processes will continue to run while the crashed or aborted process waits. The user can override this configuration parameter by using srun's -K, --kill-on-bad-exit. and thought that if, in my mind, I replaced "process(es)" with "task(s)", it made sense, but of course, I had to go and RTFsrunM, didn't I, vis: -K, --kill-on-bad-exit[=0|1] Controls whether or not to terminate a step if any task exits with a non-zero exit code. If this option is not specified, the default action will be based upon the Slurm configuration parameter of KillOnBadExit. If this option is specified, it will take precedence over KillOnBadExit. An option argument of zero will not terminate the job. A non-zero argument or no argument will terminate the job. Note: This option takes precedence over the -W, --wait option to terminate the job immediately if a task exits with a non-zero exit code. Since this option's argument is optional, for proper parsing the single letter option must be followed immediately with the value and not include a space between them. For example "-K1" and not "-K 1". so now we're talking about the "job", as well as a "step" within a job? Then again, one could read that as the config setting only bins the step the task was in, but then the srun flag isn't overiding the config settting (per-step), it's escalating the bin-on-any-failure to the job level? Then again, srun's "bare kill-on-bad-exit" might be thought of as "overriding the config" but only to the extent that it can turn a config (per-step) of 0 into a config (per-step) of 1, by being there, but not the other way around, because there isn't any --no-kill-on-bad-exit ? And both of those suggest that the config can't be used to set a kill of a whole job, only a step but, if you want to do that, the srun man-page points out you can use -W, but suggests that that -K will override that too. So now I think I've gone two steps forwards; one job back: but where am I really? Is there a possible future, with a TaskFailureAction = Ignore|KillStep|KillJob(|KillJobArray?) config value, along with an associated --task-failure-action=[0|1|2(|3)] command-line option, in it, as that would seem to offer a clearer "this overrides that" mapping? Then again, as this wasn't what I was originally looking for/at, maybe I've missed something. Kevin Buckley -- Supercomputing Systems Administrator Pawsey Supercomputing Centre