One possible datapoint: on the node where the job ran, there were two slurmstepd processes running, both at 100%CPU even after the job had ended.
-- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode ________________________________ From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Chin,David <dw...@drexel.edu> Sent: Monday, March 15, 2021 13:52 To: Slurm-Users List <slurm-users@lists.schedmd.com> Subject: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value External. Hi, all: I'm trying to understand why a job exited with an error condition. I think it was actually terminated by Slurm: job was a Matlab script, and its output was incomplete. Here's sacct output: JobID JobName User Partition NodeList Elapsed State ExitCode ReqMem MaxRSS MaxVMSize AllocTRES AllocGRE -------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- ---------- -------------------------------- -------- 83387 ProdEmisI+ foob def node001 03:34:26 OUT_OF_ME+ 0:125 128Gn billing=16,cpu=16,node=1 83387.batch batch node001 03:34:26 OUT_OF_ME+ 0:125 128Gn 1617705K 7880672K cpu=16,mem=0,node=1 83387.extern extern node001 03:34:26 COMPLETED 0:0 128Gn 460K 153196K billing=16,cpu=16,node=1 Thanks in advance, Dave -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode Drexel Internal Data Drexel Internal Data Drexel Internal Data