I think you dialed the wrong number. We're the Beowulf people! Although,
I'm sure we can still help you. ;)
--
Prentice
On 8/13/20 4:14 PM, Altemara, Anthony wrote:
Cheers SLURM people,
We’re seeing some intermittent job failures in our SLURM cluster, all
with the same 137 exit code. I’m having difficulty in determining
whether this error code is coming from SLURM (timeout?) or the Linux
OS (process killed, maybe memory).
In this example, there’s the WEXITSTATUS in the slurmctld.log, error:0
status 35072 in the slurd.log, and ExitCode 9:0 in the accounting log….???
Does anyone have insight into how all these correlate? I’ve spent a
significant amount of time digging through the documentation, and I
don’t see a clear way on how to interpret all these…
Example: Job: 62791
[root@XXXXXXXXXXXXX] /var/log/slurm# grep -ai jobid=62791 slurmctld.log
[2020-08-13T10:58:28.599] _slurm_rpc_submit_batch_job: JobId=62791
InitPrio=4294845347 usec=679
[2020-08-13T10:58:29.080] sched: Allocate JobId=62791 NodeList=
XXXXXXXXXXXXX #CPUs=1 Partition=normal
[2020-08-13T11:17:45.275] _job_complete: JobId=62791 WEXITSTATUS 137
[2020-08-13T11:17:45.294] _job_complete: JobId=62791 done
[root@ XXXXXXXXXXXXX] /var/log/slurm# grep 62791 slurmd.log
[2020-08-13T10:58:29.090] _run_prolog: prolog with lock for job 62791
ran for 0 seconds
[2020-08-13T10:58:29.090] Launching batch job 62791 for UID 847694
[2020-08-13T11:17:45.280] [62791.batch] sending
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 35072
[2020-08-13T11:17:45.405] [62791.batch] done with job
[root@XXXXXXXXXXXXX] /var/log/slurm# sacct -j 62791
JobID JobName Partition Account AllocCPUS State
ExitCode
------------ ---------- ---------- ---------- ---------- ----------
--------
62791 nf-normal+ normal (null) 0 FAILED 9:0
[root@XXXXXXXXXXXXX] /var/log/slurm# sacct -lc | tail -n 100 | grep 62791
JobID UID JobName Partition NNodes NodeList State
Start End Timelimit
62791 847694 nf-normal+ normal 1 XXXXXXXXXXX.+
FAILED 2020-08-13T10:58:29 2020-08-13T11:17:45 UNLIMITED
Thank you!
Anthony*__*
________________________________________
*IMPORTANT* - PLEASE READ: This electronic message, including its
attachments, is CONFIDENTIAL and may contain PROPRIETARY or LEGALLY
PRIVILEGED or PROTECTED information and is intended for the authorized
recipient of the sender. If you are not the intended recipient, you
are hereby notified that any use, disclosure, copying, or distribution
of this message or any of the information included in it is
unauthorized and strictly prohibited. If you have received this
message in error, please immediately notify the sender by reply e-mail
and permanently delete this message and its attachments, along with
any copies thereof, from all locations received (e.g., computer,
mobile device, etc.). To the extent permitted by law, we may monitor
electronic communications for the purposes of ensuring compliance with
our legal and regulatory obligations and internal policies. We may
also collect email traffic headers for analyzing patterns of network
traffic and managing client relationships. For further information
see: https://www.iqvia.com/about-us/privacy/privacy-policy. Thank you.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf