Perhaps fire from srun with -vvv to get maximum verbose messages as srun fires through job.
Doug On Thu, Jan 31, 2019 at 12:07 PM Andy Riebs <andy.ri...@hpe.com> wrote: > Hi All, > > Just checking to see if this sounds familiar to anyone. > > Environment: > - CentOS 7.5 x86_64 > - Slurm 17.11.10 (but this also happened with 17.11.5) > > We typically run about 100 tests/night, selected from a handful of > favorites. For roughly 1 in 300 test runs, we see one of two mysterious > failures: > > 1. The 5 minute cancellation > > A job will be rolling along, generating it's expected output, and then > this message appears: > > srun: forcing job termination > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT > 2019-01-30T07:35:50 *** > srun: error: nodename: task 250: Terminated > srun: Terminating job step 3531.0 > > sacct reports > > JobID Start End ExitCode State > ------------ ------------------- ------------------- -------- ---------- > 3418 2019-01-29T05:54:07 2019-01-29T05:59:16 0:9 FAILED > > These failures consistently happen at just about 5 minutes into the run > when they happen. > > 2. The random cancellation > > As above, a job will be generating the expected output, and then we see > > srun: forcing job termination > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT > 2019-01-30T07:35:50 *** > srun: error: nodename: task 250: Terminated > srun: Terminating job step 3531.0 > > But this time, sacct reports > > JobID Start End ExitCode State > ------------ ------------------- ------------------- -------- ---------- > 3531 2019-01-30T07:21:25 2019-01-30T07:35:50 0:0 COMPLETED > 3531.0 2019-01-30T07:21:27 2019-01-30T07:35:56 0:15 CANCELLED > > I think we've seen these cancellations pop up as soon as a minute or two > into the test run, up to perhaps 20 minutes into the run. > > The only thing slightly unusual in our job submissions is that we use > srun's "--immediate=120" so that the scripts can respond appropriately if a > node goes down. > > With SlurmctldDebug=debug2 and SlurmdDebug=debug5, there's not a clue in > the slurmctld or slurmd logs. > > Any thoughts on what might be happening, or what I might try next? > > Andy > > -- > Andy riebsandy.ri...@hpe.com > Hewlett-Packard Enterprise > High Performance Computing Software Engineering > +1 404 648 9024 > My opinions are not necessarily those of HPE > May the source be with you! > >