Hmm, apparently math is hard today. I of course meant 2^7, not 2^8. On Thu, Aug 13, 2020 at 02:37:46PM -0700, Skylar Thompson wrote: > I think this is an artifact of the job process running as a child process of > the job script, where POSIX defines the low-order 8 bits of the process > exit code as indicating which signal the child process received when it > exited. > > As others noted, 137 is 2^8+9, where 9 is SIGKILL (exceeding memory, also > exceeding the runtime request at least in the Grid Engine world). > > On Thu, Aug 13, 2020 at 02:24:49PM -0700, Alex Chekholko via Beowulf wrote: > > This may be a "cargo cult" answer from old SGE days but IIRC "137" was > > "128+9" and it means the process got signal 9 which means _something_ sent > > it a SIGKILL. > > > > On Thu, Aug 13, 2020 at 2:22 PM Prentice Bisbal via Beowulf < > > beowulf@beowulf.org> wrote: > > > > > I think you dialed the wrong number. We're the Beowulf people! Although, > > > I'm sure we can still help you. ;) > > > > > > -- > > > Prentice > > > On 8/13/20 4:14 PM, Altemara, Anthony wrote: > > > > > > Cheers SLURM people, > > > > > > > > > > > > We’re seeing some intermittent job failures in our SLURM cluster, all with > > > the same 137 exit code. I’m having difficulty in determining whether this > > > error code is coming from SLURM (timeout?) or the Linux OS (process > > > killed, > > > maybe memory). > > > > > > > > > > > > In this example, there’s the WEXITSTATUS in the slurmctld.log, error:0 > > > status 35072 in the slurd.log, and ExitCode 9:0 in the accounting log….??? > > > > > > > > > > > > Does anyone have insight into how all these correlate? I’ve spent a > > > significant amount of time digging through the documentation, and I don’t > > > see a clear way on how to interpret all these… > > > > > > > > > > > > > > > > > > Example: Job: 62791 > > > > > > > > > > > > [root@XXXXXXXXXXXXX] /var/log/slurm# grep -ai jobid=62791 slurmctld.log > > > > > > [2020-08-13T10:58:28.599] _slurm_rpc_submit_batch_job: JobId=62791 > > > InitPrio=4294845347 usec=679 > > > > > > [2020-08-13T10:58:29.080] sched: Allocate JobId=62791 NodeList= > > > XXXXXXXXXXXXX #CPUs=1 Partition=normal > > > > > > [2020-08-13T11:17:45.275] _job_complete: JobId=62791 WEXITSTATUS 137 > > > > > > [2020-08-13T11:17:45.294] _job_complete: JobId=62791 done > > > > > > > > > > > > > > > > > > [root@ XXXXXXXXXXXXX] /var/log/slurm# grep 62791 slurmd.log > > > > > > [2020-08-13T10:58:29.090] _run_prolog: prolog with lock for job 62791 ran > > > for 0 seconds > > > > > > [2020-08-13T10:58:29.090] Launching batch job 62791 for UID 847694 > > > > > > [2020-08-13T11:17:45.280] [62791.batch] sending > > > REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 35072 > > > > > > [2020-08-13T11:17:45.405] [62791.batch] done with job > > > > > > > > > > > > > > > > > > [root@XXXXXXXXXXXXX] /var/log/slurm# sacct -j 62791 > > > > > > JobID JobName Partition Account AllocCPUS State > > > ExitCode > > > > > > ------------ ---------- ---------- ---------- ---------- ---------- > > > -------- > > > > > > 62791 nf-normal+ normal (null) 0 FAILED > > > 9:0 > > > > > > > > > > > > [root@XXXXXXXXXXXXX] /var/log/slurm# sacct -lc | tail -n 100 | grep 62791 > > > > > > JobID UID JobName Partition NNodes NodeList > > > State Start End Timelimit > > > > > > 62791 847694 nf-normal+ normal 1 XXXXXXXXXXX.+ > > > FAILED 2020-08-13T10:58:29 2020-08-13T11:17:45 UNLIMITED > > > > > > > > > > > > > > > > > > Thank you! > > > > > > > > > > > > Anthony > > > > > > > > > > > > > > > ________________________________________ > > > *IMPORTANT* - PLEASE READ: This electronic message, including its > > > attachments, is CONFIDENTIAL and may contain PROPRIETARY or LEGALLY > > > PRIVILEGED or PROTECTED information and is intended for the authorized > > > recipient of the sender. If you are not the intended recipient, you are > > > hereby notified that any use, disclosure, copying, or distribution of this > > > message or any of the information included in it is unauthorized and > > > strictly prohibited. If you have received this message in error, please > > > immediately notify the sender by reply e-mail and permanently delete this > > > message and its attachments, along with any copies thereof, from all > > > locations received (e.g., computer, mobile device, etc.). To the extent > > > permitted by law, we may monitor electronic communications for the > > > purposes > > > of ensuring compliance with our legal and regulatory obligations and > > > internal policies. We may also collect email traffic headers for analyzing > > > patterns of network traffic and managing client relationships. For further > > > information see: https://www.iqvia.com/about-us/privacy/privacy-policy. > > > Thank you. > > > > > > _______________________________________________ > > > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > > > To change your subscription (digest mode or unsubscribe) visit > > > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > > > > > > -- > > > Prentice Bisbal > > > Lead Software Engineer > > > Research Computing > > > Princeton Plasma Physics Laboratoryhttp://www.pppl.gov > > > > > > _______________________________________________ > > > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > > > To change your subscription (digest mode or unsubscribe) visit > > > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > > > > > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit > > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > > > -- > Skylar
-- Skylar _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf