Re: [slurm-users] [Slurm 18.08.4] sacct/seff Inaccurate usercpu on Job Arrays

2019-01-08 Thread Paddy Doyle
A small addition: I forgot to mention our JobAcct params:

JobAcctGatherFrequency=task=30
JobAcctGatherType=jobacct_gather/cgroup

I've done a small bit of playing around on a test cluster. Changing to
'JobAcctGatherFrequency=0' (i.e. only gather at job end) seems to then give
correct values for the job via sacct/seff.

Alternatively, setting the following also works:

JobAcctGatherFrequency=task=30
JobAcctGatherType=jobacct_gather/linux

Looking back through the mailing list, it seems that from 2015 onwards the
recommendation from Danny was to use 'jobacct_gather/linux' instead of
'jobacct_gather/cgroup'. I didn't pick up on that properly, so we kept with
the cgroup version.

Is anyone else still using jobacct_gather/cgroup and are you seeing this
same issue?

Just to note: there's a big warning in the man page not to adjust the
value of JobAcctGatherType while there are any running job steps. I'm not
sure if that means just on that node, or any jobs. Probably safest to
schedule a downtime to change it.

Paddy

On Fri, Jan 04, 2019 at 10:43:54PM +, Christopher Benjamin Coffey wrote:

> Actually we double checked and are seeing it in normal jobs too.
> 
> ???
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>  
> 
> ???On 1/4/19, 9:24 AM, "slurm-users on behalf of Paddy Doyle" 
>  wrote:
> 
> Hi Chris,
> 
> We're seeing it on 18.08.3, so I was hoping that it was fixed in 18.08.4
> (recently upgraded from 17.02 to 18.08.3). Note that we're seeing it in
> regular jobs (haven't tested job arrays).
> 
> I think it's cgroups-related; there's a similar bug here:
> 
> 
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6095&data=02%7C01%7Cchris.coffey%40nau.edu%7C4a028bf1e7ef4ad82eb808d672612269%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636822158848154399&sdata=OLX%2FiHHUqvE1CR74lViVq2b91z9bR9GmfSETeDlTEoA%3D&reserved=0
> 
> I was hoping that this note in the 18.08.4 NEWS might have been related:
> 
> -- Fix jobacct_gather/cgroup to work correctly when more than one task is
>started on a node.
> 
> Thanks,
> Paddy
> 
> On Fri, Jan 04, 2019 at 03:19:18PM +, Christopher Benjamin Coffey 
> wrote:
> 
> > I'm surprised no one else is seeing this issue? I wonder if you have 
> 18.08 you can take a moment and run jobeff on a job in one of your users job 
> arrays. I'm guessing jobeff will show the same issue as we are seeing. The 
> issue is that usercpu is incorrect, and off by many orders of magnitude.
> > 
> > Best,
> > Chris
> > 
> > ???
> > Christopher Coffey
> > High-Performance Computing
> > Northern Arizona University
> > 928-523-1167
> >  
> > 
> > ???On 12/21/18, 2:41 PM, "Christopher Benjamin Coffey" 
>  wrote:
> > 
> > So this issue is occurring only with job arrays.
> > 
> > ???
> > Christopher Coffey
> > High-Performance Computing
> > Northern Arizona University
> > 928-523-1167
> >  
> > 
> > On 12/21/18, 12:15 PM, "slurm-users on behalf of Chance Bryce Carl 
> Nelson"  chance-nel...@nau.edu> wrote:
> > 
> > Hi folks,
> > 
> > 
> > calling sacct with the usercpu flag enabled seems to provide 
> cpu times far above expected values for job array indices. This is also 
> reported by seff. For example, executing the following job script:
> > 
> > 
> > 
> > #!/bin/bash
> > #SBATCH --job-name=array_test   
> > #SBATCH --workdir=/scratch/cbn35/bigdata  
> > #SBATCH --output=/scratch/cbn35/bigdata/logs/job_%A_%a.log
> > #SBATCH --time=20:00  
> > #SBATCH --array=1-5
> > #SBATCH -c2
> > 
> > 
> > srun stress -c 2 -m 1 --vm-bytes 500M --timeout 65s
> > 
> > 
> > 
> > 
> > 
> > 
> > ...results in the following stats:
> > 
> > 
> > 
> > 
> >JobID  ReqCPUSUserCPU  TimelimitElapsed 
> >   -- -- -- 
> > 15730924_5  2   02:30:14   00:20:00   00:01:08 
> > 15730924_5.+2  00:00.004  00:01:08 
> > 15730924_5.+2   00:00:00  00:01:09 
> > 15730924_5.02   02:30:14  00:01:05 
> > 15730924_1  2   02:30:48   00:20:00   00:01:08 
> > 15730924_1.+  

Re: [slurm-users] [Slurm 18.08.4] sacct/seff Inaccurate usercpu on Job Arrays

2019-01-08 Thread Christopher Benjamin Coffey
" Looking back through the mailing list, it seems that from 2015 onwards the
recommendation from Danny was to use 'jobacct_gather/linux' instead of
'jobacct_gather/cgroup'. I didn't pick up on that properly, so we kept with
the cgroup version."

Ahh, hmm I need to dig up that recommendation as I didn't see that myself. 
We'll look into this.

Thanks Paddy!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 1/8/19, 8:04 AM, "slurm-users on behalf of Paddy Doyle" 
 wrote:

A small addition: I forgot to mention our JobAcct params:

JobAcctGatherFrequency=task=30
JobAcctGatherType=jobacct_gather/cgroup

I've done a small bit of playing around on a test cluster. Changing to
'JobAcctGatherFrequency=0' (i.e. only gather at job end) seems to then give
correct values for the job via sacct/seff.

Alternatively, setting the following also works:

JobAcctGatherFrequency=task=30
JobAcctGatherType=jobacct_gather/linux

Looking back through the mailing list, it seems that from 2015 onwards the
recommendation from Danny was to use 'jobacct_gather/linux' instead of
'jobacct_gather/cgroup'. I didn't pick up on that properly, so we kept with
the cgroup version.

Is anyone else still using jobacct_gather/cgroup and are you seeing this
same issue?

Just to note: there's a big warning in the man page not to adjust the
value of JobAcctGatherType while there are any running job steps. I'm not
sure if that means just on that node, or any jobs. Probably safest to
schedule a downtime to change it.

Paddy

On Fri, Jan 04, 2019 at 10:43:54PM +, Christopher Benjamin Coffey wrote:

> Actually we double checked and are seeing it in normal jobs too.
> 
> ???
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>  
> 
> ???On 1/4/19, 9:24 AM, "slurm-users on behalf of Paddy Doyle" 
 wrote:
> 
> Hi Chris,
> 
> We're seeing it on 18.08.3, so I was hoping that it was fixed in 
18.08.4
> (recently upgraded from 17.02 to 18.08.3). Note that we're seeing it 
in
> regular jobs (haven't tested job arrays).
> 
> I think it's cgroups-related; there's a similar bug here:
> 
> 
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6095&data=02%7C01%7Cchris.coffey%40nau.edu%7Ca8652902673c4688948308d6757a9957%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636825566755040599&sdata=sXhuwF0AcUByzXEiBrg%2BFXw4Niowhs%2B9g0uFDpq%2F19g%3D&reserved=0
> 
> I was hoping that this note in the 18.08.4 NEWS might have been 
related:
> 
> -- Fix jobacct_gather/cgroup to work correctly when more than one 
task is
>started on a node.
> 
> Thanks,
> Paddy
> 
> On Fri, Jan 04, 2019 at 03:19:18PM +, Christopher Benjamin Coffey 
wrote:
> 
> > I'm surprised no one else is seeing this issue? I wonder if you 
have 18.08 you can take a moment and run jobeff on a job in one of your users 
job arrays. I'm guessing jobeff will show the same issue as we are seeing. The 
issue is that usercpu is incorrect, and off by many orders of magnitude.
> > 
> > Best,
> > Chris
> > 
> > ???
> > Christopher Coffey
> > High-Performance Computing
> > Northern Arizona University
> > 928-523-1167
> >  
> > 
> > ???On 12/21/18, 2:41 PM, "Christopher Benjamin Coffey" 
 wrote:
> > 
> > So this issue is occurring only with job arrays.
> > 
> > ???
> > Christopher Coffey
> > High-Performance Computing
> > Northern Arizona University
> > 928-523-1167
> >  
> > 
> > On 12/21/18, 12:15 PM, "slurm-users on behalf of Chance Bryce 
Carl Nelson"  wrote:
> > 
> > Hi folks,
> > 
> > 
> > calling sacct with the usercpu flag enabled seems to 
provide cpu times far above expected values for job array indices. This is also 
reported by seff. For example, executing the following job script:
> > 
> > 
> > 
> > #!/bin/bash
> > #SBATCH --job-name=array_test   
> > #SBATCH --workdir=/scratch/cbn35/bigdata  
> > #SBATCH --output=/scratch/cbn35/bigdata/logs/job_%A_%a.log
> > #SBATCH --time=20:00  
> > #SBATCH --array=1-5
> > #SBATCH -c2
> > 
> > 
> >

[slurm-users] Slurm memory error in child process of a system() function call after malloc()

2019-01-08 Thread Péter Nagy
Dear Users,

our FORTRAN based code uses shell operations via either the system()
function or calling the corresponding system subroutine. This fails with
Slurm in certain cases. For instance, input files are manipulated as
istatus=system('cp file1 file2').

When using Slurm scheduler system() returns -1 (or 255 if unsigned), and
the requested shell operation is not performed if the system() call follows
a malloc() operation allocating more than half of the memory that is
available for the Slurm job. Unfortunately, there is no error message in
the error output or the slurm output file.

Our code, via a C interface, allocates all the available memory at once via
that single malloc() operation and works with that allocated array during
the entire runtime. All system() function calls which precede malloc() are
performed correctly, and all system() function calls fail starting from
right after malloc().
If less than half of the slurm job's memory limit is allocated with
malloc() then all system() function calls are performed perfectly.

I have tried to set the memory limit either by --mem-per-cpu or by --mem. I
also tried --mem=0 together with --exclusive.

I have tried different clusters with slurm versions of 14.03.9 and 17.11.12
and several well working FORTRAN compilers and found the same error
consistently.

Performing shell operations with system() also works perfectly on the same
node with full memory without a scheduler. There is no problem either with
SGE, OAR, or condor schedulers irrespective of the allocated memory size.

Our guess is that there might be a Slurm specific setting which does not
allow to fork a shell/child process if more than half of the memory limit
is consumed by the parent job. Slurm might assume that the child process
needs the same amount of memory as the parent and cancels it due to the
slurm job's memory limit.

Unfortunately, I did not find any error message or related error reports
and got stuck here.
Could you, please, help with suggestions how could we utilize the memory up
to the slurm job memory limit?

Thank you very much in advance,
Peter