Hi;
As far as I know exit code 141 and 13 are the same. Signal + 128 gives
exit code:
https://slurm-dev.schedmd.narkive.com/MYGH56EW/job-exit-codes
Ahmet M.
On 23.11.2018 14:36, Matthew Goulden wrote:
A confirmation re-run yielded the same outcome but the correct outcome
was available using
$ scontrol show job 197
JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=141:0
sacct still reports as before
$ sacct -j 197
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ----------
--------
197 T_113491_+ all_slt_l+ slt 1
FAILED 13:0
197.batch batch slt 1 FAILED 13:0
Matt
------------------------------------------------------------------------
*From:* Matthew Goulden
*Sent:* Friday, November 23, 2018 11:21 AM
*To:* slurm-users@lists.schedmd.com
*Subject:* new user; ExitCode reporting
Hi All,
New using migrating from uge/sge, I'm baffled by the ExitCode
recording into slurmdb; not sure if this is 'new user' issue or bug,
so exposing it here first.
Running simple sbatch scripts with these headers relevant
#!/bin/bash
#SBATCH --mail-user <me>@<work>
#SBATCH --mail-type END
#SBATCH -J T_113491_<redacted>_20150522
The sbatch calls various tools, and terminally a 'completion_reporter'
bash script reporting whether all calls have proceeded to completion.
If not the return_code from that script is passed into the sbatch
script as an exit command; the expectation is that the return code for
the sbatch script in these circumstances is that from the
completion_reporter'. That return_code is 141
GOOD
The emails received have subject line consistent with expectations
'Slurm Job_id=196 Name=T_113491_<redacted>_20150522 Ended, Run time
00:00:24, FAILED, ExitCode 141'
UNEXPECTED
However sacct output is not consistent with expectations...
$ sacct -j 196
------------ ---------- ---------- ---------- ---------- ----------
--------
196 T_113491_+ all_slt_l+ slt 1 FAILED 13:0
196.batch batch slt 1 FAILED 13:0
I've spent some time reading through the (excellent, frankly)
documentation for sbatch and job_exit_code and while learning a great
deal nothing has explained with anomaly.
Incidentally I expected to be able to use scontrol as below; any
pointers on the unexpected outcome would be welcome
$ scontrol show step 196.batch
Job step 196.0 not found
We have put a fair bit of work into informatively coding our fail
exit_codes so suggestions as to what's going on here would be welcome.
Thanks in advance
Matt
**************************************************************************
The information contained in the EMail and any attachments is
confidential and intended solely and for the attention and use of the
named addressee(s). It may not be disclosed to any other person
without the express authority of Public Health England, or the
intended recipient, or both. If you are not the intended recipient,
you must not disclose, copy, distribute or retain this message or any
part of it. This footnote also confirms that this EMail has been swept
for computer viruses by Symantec.Cloud, but please re-sweep any
attachments before opening or saving. http://www.gov.uk/PHE
**************************************************************************