Hi All, I've encountered what I think is a bug with srun's exit status when a timeout occurs, but perhaps my expectation is off. My expectation is for srun to have a non-zero exit status when a timeout occurs before all tasks can complete.
This behaves as expected when all tasks are timed out: > srun --time 1 --ntasks=2 perl -e 'sleep 120 + 120 * $ENV{SLURM_PROCID}'; echo "status: $?" srun: Force Terminated job 2392836 srun: Job step aborted: Waiting up to 62 seconds for job step to finish. slurmstepd: error: *** STEP 2392836.0 ON foo0205 CANCELLED AT 2018-04-19T18:33:34 DUE TO TIME LIMIT *** srun: error: foo0205: tasks 0-1: Terminated status: 143 However, when some tasks complete, while others are timed out, srun always exits with a zero status. This is not what I expect, since tasks were forcefully terminated: > srun --time 3 --ntasks=2 perl -e 'sleep 120 + 120 * $ENV{SLURM_PROCID}'; echo "status: $?" srun: Force Terminated job 2392845 srun: Job step aborted: Waiting up to 62 seconds for job step to finish. slurmstepd: error: *** STEP 2392845.0 ON foo3009 CANCELLED AT 2018-04-19T18:37:04 DUE TO TIME LIMIT *** srun: error: foo3009: task 1: Terminated status: 0 Is my expectation off, or does this look like a genuine bug? Thanks, - Dan