Actually we double checked and are seeing it in normal jobs too.

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 1/4/19, 9:24 AM, "slurm-users on behalf of Paddy Doyle" 
<slurm-users-boun...@lists.schedmd.com on behalf of pa...@tchpc.tcd.ie> wrote:

    Hi Chris,
    
    We're seeing it on 18.08.3, so I was hoping that it was fixed in 18.08.4
    (recently upgraded from 17.02 to 18.08.3). Note that we're seeing it in
    regular jobs (haven't tested job arrays).
    
    I think it's cgroups-related; there's a similar bug here:
    
    
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6095&amp;data=02%7C01%7Cchris.coffey%40nau.edu%7C4a028bf1e7ef4ad82eb808d672612269%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636822158848154399&amp;sdata=OLX%2FiHHUqvE1CR74lViVq2b91z9bR9GmfSETeDlTEoA%3D&amp;reserved=0
    
    I was hoping that this note in the 18.08.4 NEWS might have been related:
    
    -- Fix jobacct_gather/cgroup to work correctly when more than one task is
       started on a node.
    
    Thanks,
    Paddy
    
    On Fri, Jan 04, 2019 at 03:19:18PM +0000, Christopher Benjamin Coffey wrote:
    
    > I'm surprised no one else is seeing this issue? I wonder if you have 
18.08 you can take a moment and run jobeff on a job in one of your users job 
arrays. I'm guessing jobeff will show the same issue as we are seeing. The 
issue is that usercpu is incorrect, and off by many orders of magnitude.
    > 
    > Best,
    > Chris
    > 
    > ???
    > Christopher Coffey
    > High-Performance Computing
    > Northern Arizona University
    > 928-523-1167
    >  
    > 
    > ???On 12/21/18, 2:41 PM, "Christopher Benjamin Coffey" 
<chris.cof...@nau.edu> wrote:
    > 
    >     So this issue is occurring only with job arrays.
    >     
    >     ???
    >     Christopher Coffey
    >     High-Performance Computing
    >     Northern Arizona University
    >     928-523-1167
    >      
    >     
    >     On 12/21/18, 12:15 PM, "slurm-users on behalf of Chance Bryce Carl 
Nelson" <slurm-users-boun...@lists.schedmd.com on behalf of 
chance-nel...@nau.edu> wrote:
    >     
    >         Hi folks,
    >         
    >         
    >         calling sacct with the usercpu flag enabled seems to provide cpu 
times far above expected values for job array indices. This is also reported by 
seff. For example, executing the following job script:
    >         ________________________________________________________
    >         
    >         
    >         #!/bin/bash
    >         #SBATCH --job-name=array_test                   
    >         #SBATCH --workdir=/scratch/cbn35/bigdata          
    >         #SBATCH --output=/scratch/cbn35/bigdata/logs/job_%A_%a.log
    >         #SBATCH --time=20:00  
    >         #SBATCH --array=1-5
    >         #SBATCH -c2
    >         
    >         
    >         srun stress -c 2 -m 1 --vm-bytes 500M --timeout 65s
    >         
    >         
    >         
    >         ________________________________________________________
    >         
    >         
    >         ...results in the following stats:
    >         ________________________________________________________
    >         
    >         
    >         
    >                JobID  ReqCPUS    UserCPU  Timelimit    Elapsed 
    >         ------------ -------- ---------- ---------- ---------- 
    >         15730924_5          2   02:30:14   00:20:00   00:01:08 
    >         15730924_5.+        2  00:00.004              00:01:08 
    >         15730924_5.+        2   00:00:00              00:01:09 
    >         15730924_5.0        2   02:30:14              00:01:05 
    >         15730924_1          2   02:30:48   00:20:00   00:01:08 
    >         15730924_1.+        2  00:00.013              00:01:08 
    >         15730924_1.+        2   00:00:00              00:01:09 
    >         15730924_1.0        2   02:30:48              00:01:05 
    >         15730924_2          2   02:15:52   00:20:00   00:01:07 
    >         15730924_2.+        2  00:00.007              00:01:07 
    >         15730924_2.+        2   00:00:00              00:01:07 
    >         15730924_2.0        2   02:15:52              00:01:06 
    >         15730924_3          2   02:30:20   00:20:00   00:01:08 
    >         15730924_3.+        2  00:00.010              00:01:08 
    >         15730924_3.+        2   00:00:00              00:01:09 
    >         15730924_3.0        2   02:30:20              00:01:05 
    >         15730924_4          2   02:30:26   00:20:00   00:01:08 
    >         15730924_4.+        2  00:00.006              00:01:08 
    >         15730924_4.+        2   00:00:00              00:01:09 
    >         15730924_4.0        2   02:30:25              00:01:05 
    >         
    >         
    >         
    >         ________________________________________________________
    >         
    >         
    >         This is also reported by seff, with several errors to boot:
    >         ________________________________________________________
    >         
    >         
    >         
    >         Use of uninitialized value $lmem in numeric lt (<) at 
/usr/bin/seff line 130, <DATA> line 624.
    >         Use of uninitialized value $lmem in numeric lt (<) at 
/usr/bin/seff line 130, <DATA> line 624.
    >         Use of uninitialized value $lmem in numeric lt (<) at 
/usr/bin/seff line 130, <DATA> line 624.
    >         Job ID: 15730924
    >         Array Job ID: 15730924_5
    >         Cluster: monsoon
    >         User/Group: cbn35/clusterstu
    >         State: COMPLETED (exit code 0)
    >         Nodes: 1
    >         Cores per node: 2
    >         CPU Utilized: 03:19:15
    >         CPU Efficiency: 8790.44% of 00:02:16 core-walltime
    >         Job Wall-clock time: 00:01:08
    >         Memory Utilized: 0.00 MB (estimated maximum)
    >         Memory Efficiency: 0.00% of 1.95 GB (1000.00 MB/core)
    >         
    >         
    >         
    >         ________________________________________________________
    >         
    >         
    >         
    >         
    >         
    >         As far as I can tell, I don't think a two core job with an 
elapsed time of around one minute would have a cpu time of two hours. Could 
this be a configuration issue, or is it a possible bug? 
    >         
    >         
    >         More info is available on request, and any help is appreciated!
    >         
    >         
    >         
    >         
    >         
    >     
    >     
    > 
    
    -- 
    Paddy Doyle
    Trinity Centre for High Performance Computing,
    Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
    Phone: +353-1-896-3725
    
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.tchpc.tcd.ie%2F&amp;data=02%7C01%7Cchris.coffey%40nau.edu%7C4a028bf1e7ef4ad82eb808d672612269%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636822158848154399&amp;sdata=blFosTBYhKy3eIVrWMmlpkrtPj%2FCWKyUEtK6clFcC4I%3D&amp;reserved=0
    
    

Reply via email to