I was actually looking at something else (tm) when I noticed that
two of our Slurm controlled resources had different config values
for KillOnBadExit, and so I went looking for clues.


I read this:

KillOnBadExit

    If set to 1, a step will be terminated immediately if any task is
    crashed or aborted, as indicated by a non-zero exit code.

    With the default value of 0, if one of the processes is crashed
    or aborted the other processes will continue to run while the
    crashed or aborted process waits.

    The user can override this configuration parameter by using srun's
    -K, --kill-on-bad-exit.


and thought that if, in my mind, I replaced "process(es)" with "task(s)",
it made sense, but of course, I had to go and RTFsrunM, didn't I, vis:


 -K, --kill-on-bad-exit[=0|1]

    Controls whether or not to terminate a step if any task exits with
    a non-zero exit code.

    If this option is not specified, the default action will be based
    upon the Slurm configuration parameter of KillOnBadExit. If this
    option is specified, it will take precedence over KillOnBadExit.

    An option argument of zero will not terminate the job. A non-zero
    argument or no argument will terminate the job.

    Note: This option takes precedence over the -W, --wait option to
    terminate the job immediately if a task exits with a non-zero exit
    code.

    Since this option's argument is optional, for proper parsing the
    single letter option must be followed immediately with the value
    and not include a space between them. For example "-K1" and not
    "-K 1".


so now we're talking about the "job", as well as a "step" within
a job?

Then again, one could read that as the config setting only bins the
step the task was in, but then the srun flag isn't overiding the
config settting (per-step), it's escalating the bin-on-any-failure
to the job level?

Then again, srun's "bare kill-on-bad-exit" might be thought of as
"overriding the config" but only to the extent that it can turn
a config (per-step) of 0 into a config (per-step) of 1, by being
there, but not the other way around, because there isn't any
--no-kill-on-bad-exit ?

And both of those suggest that the config can't be used to set a
kill of a whole job, only a step but, if you want to do that, the
srun man-page points out you can use -W, but suggests that that -K
will override that too.

So now I think I've gone two steps forwards; one job back: but where
am I really?


Is there a possible future, with a

TaskFailureAction = Ignore|KillStep|KillJob(|KillJobArray?)

config value, along with an associated

--task-failure-action=[0|1|2(|3)]

command-line option, in it, as that would seem to offer a clearer
"this overrides that" mapping?

Then again, as this wasn't what I was originally looking for/at,
maybe I've missed something.

Kevin Buckley
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre

Reply via email to