Problem with cpu time

Uwe Bolick Mon, 17 Sep 2012 07:53:21 -0700

Hi,

We have observed a strange behaviour on our compute-nodes after the
upgrade to squeeze and on new nodes freshly installed with squeeze.


All processes running longer than 24.8 days lead to "nonsense"
cpu-time. Below is an example output of "ps -u username f" over time:

[2012-05-29 05:49:33] 30590 ?        R    35793:27 ./fortran_kdis 2 25
[2012-05-29 05:49:38] 30590 ?        R    35793:32 ./fortran_kdis 2 25
[2012-05-29 05:49:43] 30590 ?        R    35793:37 ./fortran_kdis 2 25
[2012-05-29 05:49:48] 30590 ?        R    11129636:45 ./fortran_kdis 2 25
[2012-05-29 05:49:53] 30590 ?        R    11129636:45 ./fortran_kdis 2 25
[2012-05-29 05:49:58] 30590 ?        R    11129636:45 ./fortran_kdis 2 25
[2012-05-29 11:20:36] 30590 ?        R    11129636:45 ./fortran_kdis 2 25

Several days later, the accumulated cpu time value remains the same.

The daily report of the "Grid Engine 2011.11" job scheduler for this
job shows:

...
...:86412.030000:6925734.008546:...
...:86380.140000:6923160.882762:...
...:86423.790000:6926644.923450:...
...:30016546.230000:2405779823.509468:...  <---- day with jump
...:0.000000:0.000000:...
...:0.000000:0.000000:...
...:0.000000:0.000000:...
...:0.000000:0.000000:...
...:0.000000:0.000000:...
...:0.000000:0.000000::...
...:0.000000:0.000000:...
...:17112.340000:1371446.414438:...
...:86395.480000:6924655.520745:...
...:86411.810000:6926306.216313:...
...:86415.170000:6926575.536817:...
...:85071.220000:6818939.616130:...
...

The two numbers between the ... represent values for "ru_utime" and
"ru_stime". The accounting values for ru_utime the days before the
"jump" are correct but afterwards they are nonsense for some days and
than ok again (this job was running with 100% cpu usage all the
time!). But all values for ru_stime are looking strange. Keep in mind:
1 day == 86400 sec.

In addition for all jobs showing this behaviour after 35793:37, the
values for the accumulated cpu-usage differ for every job:

[2012-05-29 05:18:32] 30591 ?        R    10557290:44 ./fortran_kdis 2 27
[2012-05-29 05:34:42] 30636 ?        R    11129626:19 ./fortran_kdis 2 31
[2012-05-29 05:58:20] 30637 ?        R    12274089:59 ./fortran_kdis 2 30
[2012-05-29 06:02:37] 30630 ?        R    12274256:17 ./fortran_kdis 2 28
[2012-05-29 06:03:12] 30634 ?        R    11129641:38 ./fortran_kdis 2 29
[2012-05-29 06:09:44] 30638 ?        R    12274280:17 ./fortran_kdis 2 32
[2012-05-29 06:23:55] 30587 ?        R    11701990:44 ./fortran_kdis 2 26

Used kernel and architecture are:
# uname -a
Linux warg09 2.6.32-5-amd64 #1 SMP Sun May 6 04:00:17 UTC 2012 x86_64 GNU/Linux

Any help to get rid of this issue, would be highly appreciated.

Thanks in advance...

-- 
 Uwe Bolick
 Zentrum für Astronomie und Astrophysik
 Technische Universität Berlin
 EW 8-1, Hardenbergstr. 36, D-10623 Berlin (Germany)


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/20120917143547.ga15...@astro.physik.tu-berlin.de

Problem with cpu time

Reply via email to