date:20210315

Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David

Hi, Sean: Slurm version 20.02.6 (via Bright Cluster Manager) ProctrackType=proctrack/cgroup JobAcctGatherType=jobacct_gather/linux JobAcctGatherParams=UsePss,NoShared I just skimmed https://bugs.schedmd.com/show_bug.cgi?id=5549 because this job appeared to have left two slurmstepd zombie

Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Sean Crosby

What are your Slurm settings - what's the values of ProctrackType JobAcctGatherType JobAcctGatherParams and what's the contents of cgroup.conf? Also, what version of Slurm are you using? Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Ser

Re: [slurm-users] [EXTERNAL] Re: Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chad DeWitt

Hi Dave, Hope you're doing well. (...very possible you have already done these things...) Maybe the logs on the compute node (system and slurmd.log) would yield more info? Rolling dice, it may also be worth a look for runaway processes or jobs on that compute node as well as confirm the node is

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David

One possible datapoint: on the node where the job ran, there were two slurmstepd processes running, both at 100%CPU even after the job had ended. -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.e

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David

Hi Michael: I looked at the Matlab script: it's loading an xlsx file which is 2.9 kB. There are some "static" arrays allocated with ones() or zeros(), but those use small subsets (< 10 columns) of the loaded data, and outputs are arrays of 6x10. Certainly there are not 16e9 rows in the original

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David

Here's seff output, if it makes any difference. In any case, the exact same job was run by the user on their laptop with 16 GB RAM with no problem. Job ID: 83387 Cluster: picotte User/Group: foob/foob State: OUT_OF_MEMORY (exit code 0) Nodes: 1 Cores per node: 16 CPU Utilized: 06:50:30 CPU Effici

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Renfro, Michael

Just a starting guess, but are you certain the MATLAB script didn’t try to allocate enormous amounts of memory for variables? That’d be about 16e9 floating point values, if I did the units correctly. On Mar 15, 2021, at 12:53 PM, Chin,David wrote: External Email Warning This email originat

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Paul Edmon

One should keep in mind that sacct results for memory usage are not accurate for Out Of Memory (OoM) jobs. This is due to the fact that the job is typically terminated prior to next sacct polling period, and also terminated prior to it reaching full memory allocation. Thus I wouldn't trust an

[slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David

Hi, all: I'm trying to understand why a job exited with an error condition. I think it was actually terminated by Slurm: job was a Matlab script, and its output was incomplete. Here's sacct output: JobIDJobName User PartitionNodeListElapsed State Exit

Re: [slurm-users] slurm bank and sreport tres minute usage problem

2021-03-15 Thread Miguel Oliveira

Hi Paul, Thank you for your reply. Good to know that in your case you get consistent replies. I had done a similar analises. Starting with a user I got from the accounting records: sacct -X -u rsantos --starttime=2020-01-01 --endtime=now -o jobid,part,account,start,end,elapsed,alloctres%80 | g

Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

Re: [slurm-users] [EXTERNAL] Re: Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

[slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

Re: [slurm-users] slurm bank and sreport tres minute usage problem

10 matches

Site Navigation

Mail list logo

Footer information