Hello all, > A closer look revealed that jag_common_poll_data isn't consistent in > logging the total memory usage, depending on the application in > question.
Ugh. I'm embarrassed to say that the issue was tracked back to incorrectly using proctrack/pgid vs. proctrack/linuxproc; sadly this is an artifact from our testing days - and was never changed after our move to production! RTFM, dude! John DeSantis On Tue, 5 Sep 2017 11:40:15 -0600 John DeSantis <desan...@usf.edu> wrote: > > Hello all, > > We were recently alerted by a user whose long running jobs (>= 6 > days) were killed by oom. > > A closer look revealed that jag_common_poll_data isn't consistent in > logging the total memory usage, depending on the application in > question. For example, consider the following job submission: > > #!/bin/bash > # > #SBATCH --job-name=sif14_1co2 > #SBATCH --output=runlog.log > #SBATCH --time=48:00:00 > #SBATCH --nodes=8 --ntasks-per-node=6 > #SBATCH --partition=p2016 --qos=p16 > #SBATCH --mail-type=END > #SBATCH --mail-user=user > > module purge > module add apps/vasp/5.4.1 > > # For standard vasp > mpirun vasp_std > > > Since no memory syntax was offered, the DefMemPerCPU value would take > over, which is 512 MB. > > Looking at the batch host stats, here is what is actually being > reported: > > PPID LWP PSR TIME S RSS ELAPSED COMMAND VSZ > 28388 28394 1 00:00:00 S 1332 17:47:24 slurm_script 106116 > 28394 28403 1 00:00:00 S 1404 17:47:24 mpirun 106248 > 28403 28416 10 00:00:00 S 1664 17:47:24 mpiexec.hydra 17180 > 28416 28417 5 00:00:00 S 4996 17:47:24 srun 708424 > 28417 28421 3 00:00:00 S 668 17:47:24 srun 37712 > 28453 28458 0 00:00:00 S 1928 17:47:24 pmi_proxy 16384 > 28458 28471 4 15:55:13 R 1059004 17:47:24 vasp_std 2953636 > 28458 28472 10 17:01:43 R 1057672 17:47:24 vasp_std 2950168 > 28458 28473 20 17:35:54 R 1055412 17:47:24 vasp_std 2947896 > 28458 28474 3 17:53:39 R 1060060 17:47:24 vasp_std 2952712 > 28458 28475 11 17:51:52 R 1055168 17:47:24 vasp_std 2947772 > 28458 28476 23 17:53:42 R 1063888 17:47:24 vasp_std 2956572 > > [2017-09-05T13:11:09.124] [10212120] debug: jag_common_poll_data: > Task average frequency = 2197 pid 28394 mem size 4400 229544 time > 0.020000(0+0) [2017-09-05T13:11:09.280] [10212120.0] debug: > jag_common_poll_data: Task average frequency = 2197 pid 28458 mem > size 1928 205988 time 0.020000(0+0) [2017-09-05T13:11:31.729] > [10212120] debug: jag_common_poll_data: Task average frequency = > 2197 pid 28394 mem size 4400 229544 time 0.020000(0+0) > [2017-09-05T13:11:31.757] [10212120.0] debug: jag_common_poll_data: > Task average frequency = 2197 pid 28458 mem size 1928 205988 time > 0.020000(0+0) [2017-09-05T13:11:39.126] [10212120] debug: > jag_common_poll_data: Task average frequency = 2197 pid 28394 mem > size 4400 229544 time 0.020000(0+0) [2017-09-05T13:11:39.283] > [10212120.0] debug: jag_common_poll_data: Task average frequency = > 2197 pid 28458 mem size 1928 205988 time 0.020000(0+0) > > However, if a slightly older version of the same software is used, > memory seems to be reported correctly. Consider the submission > script below. > > #!/bin/sh > > #SBATCH --time=04:00:00 > #SBATCH -N 2 --ntasks-per-node=2 -J "vasp_run" > #SBATCH --mem-per-cpu=1800 > #SBATCH --reservation=blah > #SBATCH -p blah > #SBATCH -w nodes-[1-2] > > module purge > module load apps/vasp/5.3.3 > > mpirun vasp > > And, the reported stats: > > PPID LWP PSR TIME S RSS ELAPSED COMMAND > 22046 22050 0 00:00:00 S 1324 03:32 slurm_script > 22050 22053 0 00:00:00 S 1476 03:32 mpirun > 22053 22111 1 00:00:00 S 1428 03:32 mpiexec.hydra > 22111 22112 2 00:00:00 S 4800 03:32 srun > 22112 22113 3 00:00:00 S 668 03:32 srun > 22121 22126 3 00:00:00 S 1352 03:32 pmi_proxy > 22126 22127 5 00:03:30 R 1781544 03:32 vasp > 22126 22128 0 00:03:30 R 1778052 03:32 vasp > > > [2017-09-05T13:17:58.097] [10215464] debug: jag_common_poll_data: > Task average frequency = 2527 pid 22050 mem size 4228 230384 time > 0.010000(0+0) [2017-09-05T13:17:58.265] [10215464.0] debug: > jag_common_poll_data: Task average frequency = 2527 pid 22126 mem > size 3538576 7561020 time 59.630000(56+2) [2017-09-05T13:18:28.101] > [10215464] debug: jag_common_poll_data: Task average frequency = > 2527 pid 22050 mem size 4228 230384 time 0.010000(0+0) > [2017-09-05T13:18:28.270] [10215464.0] debug: jag_common_poll_data: > Task average frequency = 2505 pid 22126 mem size 3560928 7655316 time > 119.590000(116+3) [2017-09-05T13:18:58.105] [10215464] debug: > jag_common_poll_data: Task average frequency = 2527 pid 22050 mem > size 4228 230384 time 0.010000(0+0) [2017-09-05T13:18:58.275] > [10215464.0] debug: jag_common_poll_data: Task average frequency = > 2498 pid 22126 mem size 3560932 7655316 time 179.570000(175+3) > [2017-09-05T13:19:28.109] [10215464] debug: jag_common_poll_data: > Task average frequency = 2527 pid 22050 mem size 4228 230384 time > 0.010000(0+0) [2017-09-05T13:19:28.281] [10215464.0] debug: > jag_common_poll_data: Task average frequency = 2495 pid 22126 mem > size 3560932 7655316 time 239.550000(234+4) [2017-09-05T13:19:58.113] > [10215464] debug: jag_common_poll_data: Task average frequency = > 2527 pid 22050 mem size 4228 230384 time 0.010000(0+0) > [2017-09-05T13:19:58.286] [10215464.0] debug: jag_common_poll_data: > Task average frequency = 2493 pid 22126 mem size 3560936 7655316 time > 299.510000(294+5) > > > > Has anyone else noticed anything similar on their cluster(s)? I > cannot confirm if this was happening before we upgraded from 15.08.4 > to 16.05.10-2. > > Thanks, > John DeSantis