This may be a "cargo cult" answer from old SGE days but IIRC "137" was "128+9" and it means the process got signal 9 which means _something_ sent it a SIGKILL.
On Thu, Aug 13, 2020 at 2:22 PM Prentice Bisbal via Beowulf < beowulf@beowulf.org> wrote: > I think you dialed the wrong number. We're the Beowulf people! Although, > I'm sure we can still help you. ;) > > -- > Prentice > On 8/13/20 4:14 PM, Altemara, Anthony wrote: > > Cheers SLURM people, > > > > We’re seeing some intermittent job failures in our SLURM cluster, all with > the same 137 exit code. I’m having difficulty in determining whether this > error code is coming from SLURM (timeout?) or the Linux OS (process killed, > maybe memory). > > > > In this example, there’s the WEXITSTATUS in the slurmctld.log, error:0 > status 35072 in the slurd.log, and ExitCode 9:0 in the accounting log….??? > > > > Does anyone have insight into how all these correlate? I’ve spent a > significant amount of time digging through the documentation, and I don’t > see a clear way on how to interpret all these… > > > > > > Example: Job: 62791 > > > > [root@XXXXXXXXXXXXX] /var/log/slurm# grep -ai jobid=62791 slurmctld.log > > [2020-08-13T10:58:28.599] _slurm_rpc_submit_batch_job: JobId=62791 > InitPrio=4294845347 usec=679 > > [2020-08-13T10:58:29.080] sched: Allocate JobId=62791 NodeList= > XXXXXXXXXXXXX #CPUs=1 Partition=normal > > [2020-08-13T11:17:45.275] _job_complete: JobId=62791 WEXITSTATUS 137 > > [2020-08-13T11:17:45.294] _job_complete: JobId=62791 done > > > > > > [root@ XXXXXXXXXXXXX] /var/log/slurm# grep 62791 slurmd.log > > [2020-08-13T10:58:29.090] _run_prolog: prolog with lock for job 62791 ran > for 0 seconds > > [2020-08-13T10:58:29.090] Launching batch job 62791 for UID 847694 > > [2020-08-13T11:17:45.280] [62791.batch] sending > REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 35072 > > [2020-08-13T11:17:45.405] [62791.batch] done with job > > > > > > [root@XXXXXXXXXXXXX] /var/log/slurm# sacct -j 62791 > > JobID JobName Partition Account AllocCPUS State > ExitCode > > ------------ ---------- ---------- ---------- ---------- ---------- > -------- > > 62791 nf-normal+ normal (null) 0 FAILED > 9:0 > > > > [root@XXXXXXXXXXXXX] /var/log/slurm# sacct -lc | tail -n 100 | grep 62791 > > JobID UID JobName Partition NNodes NodeList > State Start End Timelimit > > 62791 847694 nf-normal+ normal 1 XXXXXXXXXXX.+ > FAILED 2020-08-13T10:58:29 2020-08-13T11:17:45 UNLIMITED > > > > > > Thank you! > > > > Anthony > > > > > ________________________________________ > *IMPORTANT* - PLEASE READ: This electronic message, including its > attachments, is CONFIDENTIAL and may contain PROPRIETARY or LEGALLY > PRIVILEGED or PROTECTED information and is intended for the authorized > recipient of the sender. If you are not the intended recipient, you are > hereby notified that any use, disclosure, copying, or distribution of this > message or any of the information included in it is unauthorized and > strictly prohibited. If you have received this message in error, please > immediately notify the sender by reply e-mail and permanently delete this > message and its attachments, along with any copies thereof, from all > locations received (e.g., computer, mobile device, etc.). To the extent > permitted by law, we may monitor electronic communications for the purposes > of ensuring compliance with our legal and regulatory obligations and > internal policies. We may also collect email traffic headers for analyzing > patterns of network traffic and managing client relationships. For further > information see: https://www.iqvia.com/about-us/privacy/privacy-policy. > Thank you. > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > > -- > Prentice Bisbal > Lead Software Engineer > Research Computing > Princeton Plasma Physics Laboratoryhttp://www.pppl.gov > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf