Hi Chance,

Can you check your slurm.conf's TaskPlugin and TaskPluginParam? or cgroup 
settings. The tasks may not even be constrained to a group of cores.


The 00:02:16 core-walltime seems odd though as you've set each job for 40 cpu 
minutes (20 minutes * 2 cores) Are you using a debug partition with restricted 
walltimes?


Regards,

   Sam


________________________________
Sam Hawarden
Assistant Research Fellow
Pathology Department
Dunedin School of Medicine
sam.hawarden(at)otago.ac.nz
________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Chance 
Bryce Carl Nelson <chance-nel...@nau.edu>
Sent: Saturday, 22 December 2018 08:11
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] [Slurm 18.08.4] sacct/seff Inaccurate usercpu on Job 
Arrays

Hi folks,

calling sacct with the usercpu flag enabled seems to provide cpu times far 
above expected values for job array indices. This is also reported by seff. For 
example, executing the following job script:
________________________________________________________

#!/bin/bash
#SBATCH --job-name=array_test
#SBATCH --workdir=/scratch/cbn35/bigdata
#SBATCH --output=/scratch/cbn35/bigdata/logs/job_%A_%a.log
#SBATCH --time=20:00
#SBATCH --array=1-5
#SBATCH -c2

srun stress -c 2 -m 1 --vm-bytes 500M --timeout 65s

________________________________________________________

...results in the following stats:
________________________________________________________

       JobID  ReqCPUS    UserCPU  Timelimit    Elapsed
------------ -------- ---------- ---------- ----------
15730924_5          2   02:30:14   00:20:00   00:01:08
15730924_5.+        2  00:00.004              00:01:08
15730924_5.+        2   00:00:00              00:01:09
15730924_5.0        2   02:30:14              00:01:05
15730924_1          2   02:30:48   00:20:00   00:01:08
15730924_1.+        2  00:00.013              00:01:08
15730924_1.+        2   00:00:00              00:01:09
15730924_1.0        2   02:30:48              00:01:05
15730924_2          2   02:15:52   00:20:00   00:01:07
15730924_2.+        2  00:00.007              00:01:07
15730924_2.+        2   00:00:00              00:01:07
15730924_2.0        2   02:15:52              00:01:06
15730924_3          2   02:30:20   00:20:00   00:01:08
15730924_3.+        2  00:00.010              00:01:08
15730924_3.+        2   00:00:00              00:01:09
15730924_3.0        2   02:30:20              00:01:05
15730924_4          2   02:30:26   00:20:00   00:01:08
15730924_4.+        2  00:00.006              00:01:08
15730924_4.+        2   00:00:00              00:01:09
15730924_4.0        2   02:30:25              00:01:05

________________________________________________________

This is also reported by seff, with several errors to boot:
________________________________________________________

Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff line 130, 
<DATA> line 624.
Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff line 130, 
<DATA> line 624.
Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff line 130, 
<DATA> line 624.
Job ID: 15730924
Array Job ID: 15730924_5
Cluster: monsoon
User/Group: cbn35/clusterstu
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 03:19:15
CPU Efficiency: 8790.44% of 00:02:16 core-walltime
Job Wall-clock time: 00:01:08
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 1.95 GB (1000.00 MB/core)

________________________________________________________


As far as I can tell, I don't think a two core job with an elapsed time of 
around one minute would have a cpu time of two hours. Could this be a 
configuration issue, or is it a possible bug?

More info is available on request, and any help is appreciated!

Reply via email to