Hmmm, I can't quite replicate that:
dmj@cori11:~> salloc -C knl -q interactive -N 2 --no-shell salloc: Granted job allocation 18219715 salloc: Waiting for resource configuration salloc: Nodes nid0[2318-2319] are ready for job dmj@cori11:~> srun --jobid=18219715 /bin/false srun: error: nid02318: task 0: Exited with exit code 1 srun: Terminating job step 18219715.0 srun: error: nid02319: task 1: Exited with exit code 1 dmj@cori11:~> echo $? 1 dmj@cori11:~> squeue -u dmj JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 18219715 interacti (null) dmj R 0:57 2 nid0[2318-2319] dmj@cori11:~> srun --jobid=18219715 /bin/false srun: error: nid02319: task 1: Exited with exit code 1 srun: Terminating job step 18219715.1 srun: error: nid02318: task 0: Exited with exit code 1 dmj@cori11:~> squeue -u dmj JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 18219715 interacti (null) dmj R 1:17 2 nid0[2318-2319] dmj@cori11:~> Is it possible that your failing sruns are not properly terminating when the first rank crashes and is actually consuming all the requested time? -Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer Acting Group Lead, Computational Systems Group National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacob...@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Thu, Jan 24, 2019 at 9:24 AM Pritchard Jr., Howard <howa...@lanl.gov> wrote: > Hello Slurm experts, > > We have a workflow where we have a script which invoke salloc —noshell and > then launches a series of MPI > jobs using srun with the jobid= option to make use of the reservation we > got from the salloc invocation. > We are needing to do things this way because the script itself needs to > report back the results of the > tests to an external server running at AWS. The compute nodes within the > allocated partition have no connectivity > to the internet, hence our use of the —noshell option. > > This is all fine except for an annoying behavior of slurm. If we have no > test failures, I.e. all srun’ed tests > exist successfully everything works fine. However, once we start having > failed tests, and hence non zero > status return from srun, we maybe get one or two tests to run, and then > slurm cancels the reservation. > > Here’s an example output from the script as its running some MPI tests, > then some fail, then slurm drops > our reservation: > > ExecuteCmd start: srun -n 16 -c 4 --mpi=pmix --jobid=2974 > /users/foobar/runInAllocMTT/mtt/masterWa > > lloc_scratch/TestGet_IBM/ompi-tests/ibm/random/leakcatch > > stdout: seed value: -219475876 > > stdout: 0 > > stdout: 1 > > stdout: 2 > > stdout: 3 > > stdout: 4 > > stdout: 5 > > stdout: 6 > > stdout: 7 > > stdout: 8 > > stdout: 9 > > stdout: 10 > > stdout: 11 > > stdout: 12 > > stdout: 13 > > stdout: 14 > > stdout: 15 > > stdout: 16 > > stdout: 17 > > stdout: 18 > > stdout: 19 > > stdout: 20 > > stdout: ERROR: buf 778 element 749856 is 103 should be 42 > > stderr: > -------------------------------------------------------------------------- > > stderr: MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > > stderr: with errorcode 16. > > stderr: > > stderr: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > > stderr: You may or may not see output from other processes, depending on > > stderr: exactly when Open MPI kills them. > > stderr: > -------------------------------------------------------------------------- > > stderr: srun: Job step aborted: Waiting up to 32 seconds for job step to > finish. > > stderr: slurmstepd: error: *** STEP 2974.490 ON st03 CANCELLED AT > 2019-01-22T20:02:22 *** > > stderr: srun: error: st03: task 0: Exited with exit code 16 > > stderr: srun: error: st03: tasks 1-15: Killed > > ExecuteCmd done > > ExecuteCmd start: srun -n 16 -c 4 --mpi=pmix --jobid=2974 > /users/foobar/runInAllocMTT/mtt/masterWa > > lloc_scratch/TestGet_IBM/ompi-tests/ibm/random/maxsoak > > stderr: srun: Job step aborted: Waiting up to 32 seconds for job step to > finish. > > stderr: slurmstepd: error: *** STEP 2974.491 ON st03 CANCELLED AT > 2019-01-22T23:06:08 DUE TO TIME LI > > MIT *** > > stderr: srun: error: st03: tasks 0-15: Terminated > > ExecuteCmd done > > ExecuteCmd start: srun -n 16 -c 4 --mpi=pmix --jobid=2974 > /users/foobar/runInAllocMTT/mtt/masterWa > > lloc_scratch/TestGet_IBM/ompi-tests/ibm/random/op_commutative > > stderr: srun: error: Unable to allocate resources: Invalid job id specified > > ExecuteCmd done > > This is not due to the allocation being revoked due to a time limit, even > though the message says such. The job had been running only about 30 > minutes > into a 3 hour reservation. We’ve double checked that and on one cluster > which we can configure, we set the default > job timelimit to infinite and still observe the issue. But the fact that > SLURM is reporting its a TIMELIMIT thing may be hinting at what’s going on > that > SLURM revokes the allocation. > > We see this on every cluster we’ve tried so far, so it doesn’t appear to > be a site-specific configuration issue. > > Any insights into how to workaround/fix this problem would be appreciated. > > Thanks, > > Howard > > > -- > Howard Pritchard > B Schedule > HPC-ENV > > Office 9, 2nd floor Research Park > > TA-03, Building 4200, Room 203 > Los Alamos National Laboratory > >