Hi, Sean:
Slurm version 20.02.6 (via Bright Cluster Manager)
ProctrackType=proctrack/cgroup
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherParams=UsePss,NoShared
I just skimmed https://bugs.schedmd.com/show_bug.cgi?id=5549 because this job
appeared to have left two slurmstepd zombie
What are your Slurm settings - what's the values of
ProctrackType
JobAcctGatherType
JobAcctGatherParams
and what's the contents of cgroup.conf? Also, what version of Slurm are you
using?
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Ser
Hi Dave,
Hope you're doing well.
(...very possible you have already done these things...)
Maybe the logs on the compute node (system and slurmd.log) would yield more
info?
Rolling dice, it may also be worth a look for runaway processes or jobs on
that compute node as well as confirm the node is
One possible datapoint: on the node where the job ran, there were two
slurmstepd processes running, both at 100%CPU even after the job had ended.
--
David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.e
Hi Michael:
I looked at the Matlab script: it's loading an xlsx file which is 2.9 kB.
There are some "static" arrays allocated with ones() or zeros(), but those use
small subsets (< 10 columns) of the loaded data, and outputs are arrays of
6x10. Certainly there are not 16e9 rows in the original
Here's seff output, if it makes any difference. In any case, the exact same job
was run by the user on their laptop with 16 GB RAM with no problem.
Job ID: 83387
Cluster: picotte
User/Group: foob/foob
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 06:50:30
CPU Effici
Just a starting guess, but are you certain the MATLAB script didn’t try to
allocate enormous amounts of memory for variables? That’d be about 16e9
floating point values, if I did the units correctly.
On Mar 15, 2021, at 12:53 PM, Chin,David wrote:
External Email Warning
This email originat
One should keep in mind that sacct results for memory usage are not
accurate for Out Of Memory (OoM) jobs. This is due to the fact that the
job is typically terminated prior to next sacct polling period, and also
terminated prior to it reaching full memory allocation. Thus I wouldn't
trust an
Hi, all:
I'm trying to understand why a job exited with an error condition. I think it
was actually terminated by Slurm: job was a Matlab script, and its output was
incomplete.
Here's sacct output:
JobIDJobName User PartitionNodeListElapsed
State Exit
Hi Paul,
Thank you for your reply. Good to know that in your case you get consistent
replies. I had done a similar analises.
Starting with a user I got from the accounting records:
sacct -X -u rsantos --starttime=2020-01-01 --endtime=now -o
jobid,part,account,start,end,elapsed,alloctres%80 | g
10 matches
Mail list logo