Hello:

This is a long post, and I apologize for that, but I wanted to include all
possible data related to the matter.  Thanks in advance for any help on
this.

I've run into something that I've never seen in all my years of
working with RedHat Linux.  We have 4 VALinux 1220 servers, dual PIII
800MHz PCU's, 1GB RAM, 1 GB swap.  All are running VALinux's enhanced
RH 6.2.1 OS, kernel 2.2.18pre11-va2.1smp #2 SMP Thu May 10 13:31:39 PDT
2001 i686 unknown.  The kernel has been recompiled once to increase
NR_TASKS from 512 to 2560 and MAX_TASKS_PER_USER to 2048 in
/usr/src/linux/include/linux/tasks.h - but everything else left the same.

All 4 pretty much run the same applications, though there are slight
differences - that wouldn't cause this problem.  There is something
very strange when I run `procinfo` or `sar` on boxes 2 & 4, they give
ridiculous results; while 1 & 3 give normal looking results.

2 & 4 both show 0% CPU for user, nice, system, and idle, example from
#2

Memory:      Total        Used        Free      Shared     Buffers
Cached
Mem:       1048132      879876      168256           0      148252
53452
Swap:      1052248       94336      957912

Bootup: Fri Jul 20 19:31:18 2001    Load average: 0.24 0.18 0.17 1/528
18091

user  :   9d  0:31:58.65   0.0%  page in :  4199302  disk 1:
725731r38379141w
nice  :  10d 17:54:27.50   0.0%  page out: 44259101
system:  10d  9:32:42.64   0.0%  swap in :   167943
idle  : 479d  9:42:48.65   0.0%  swap out:    87945
uptime: 254d 18:50:58.71         context :3591061245

Note all 0% for cpu usage.  Box #4 shows similar output.

Here is procinfo from box #1 (#3 shows similar output)
Memory:      Total        Used        Free      Shared     Buffers
Cached
Mem:       1048132      890748      157384           0      177436
80900
Swap:      1052248           0     1052248

Bootup: Thu Mar  7 04:26:28 2002    Load average: 0.25 0.28 0.27 6/455
24648

user  :      18:56:12.69   1.5%  page in :  3198992  disk 1:   943746r
8830308w
nice  :   2d 14:09:03.87   5.1%  page out: 10873687
system:   1d 16:36:38.39   3.3%  swap in :        1
idle  :  45d 16:16:29.43  90.0%  swap out:        0
uptime:  25d  8:59:12.18         context :1899728374

These CPU percentages look normal.

--------------------------------------------------------
sar -u and sar -U 0|1 show similar weirdness.  1 & 3 look normal 2 & 4
are ridiculous numbers.

sar -u on box 1 looks OK:
12:28:11       %user     %nice   %system     %idle
12:28:11       0.71%     0.16%     5.58%    93.55%

sar -u on box 2 makes no sense:
12:27:59       %user     %nice   %system     %idle
12:27:59      45.11%    50.96%   126.21%   3887.72%

Once again box 3 looks like box 1, box 4 looks like box 2 outputs.

----------------------------------------------------------
Next I did a `strace -v sar -u > file 2>&1` on boxes 4 & 3, output
below:


I'll paste the last few lines below, but compared line for line, the
straces look almost identical...

box 4

open("/proc/stat", O_RDONLY)            = 3
fstat64(0x3, 0xbfffddfc)                = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0
x40016000
read(3, "cpu  48325711 54591314 135186636"..., 1024) = 904
read(3, "", 1024)                       = 0
close(3)                                = 0
munmap(0x40016000, 4096)                = 0
time([1017692879])                      = 1017692879
write(1, "Linux 2.2.18pre11-va2.1smp (hg-p"..., 170Linux
2.2.18pre11-va2.1smp (h
g-prd-04.homegain.com)      04/01/02

12:27:59       %user     %nice   %system     %idle
12:27:59      45.11%    50.96%   126.21%   3887.72%


box 3

open("/proc/stat", O_RDONLY)            = 3
fstat64(0x3, 0xbfffddfc)                = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0
x40016000
read(3, "cpu  27090298 6436488 212908388 "..., 1024) = 894
read(3, "", 1024)                       = 0
close(3)                                = 0
munmap(0x40016000, 4096)                = 0
time([1017692891])                      = 1017692891
write(1, "Linux 2.2.18pre11-va2.1smp (hg-p"..., 169Linux
2.2.18pre11-va2.1smp (h
g-prd-03.homegain.com)      04/01/02

12:28:11       %user     %nice   %system     %idle
12:28:11       0.71%     0.16%     5.58%    93.55%


Both of these straces start at "open /proc/stat", they look amost the
same.
The one line "read(3, "cpu N1 N2 N3..." on box 4 yields 904, while on
box 3 it yields 894.  That and there is a slight difference in the time
(probably due to slightly different uptimes).  Yet, look at the sar
results at the very bottom.  Box 3 shows normal, box 4 shows crazy
results for sar.

A similar strace between box 1 and box 2 shows identical results
except that one line "read(3, "cpu N1 N2 N3..." on box 2 is 904,
which is the same as on box 4; these being the problem boxes.  On box 1
it is 871, 23 less than on box 3.

box 2
read(3, "cpu  77951409 92818862 89811526 "..., 1024) = 904

box 1
read(3, "cpu  6816459 22322346 14591024 3"..., 1024) = 871


Whatever this line indicates and whatever the = 904 or 894/871 mean, it
seems like the only clue to this whole mystery is that both box 2 and
box 4 come up with the same result, 904,  for that one line.  And both
box 2 and box 4 show crazy results for `sar -u`.

Does anyone know what this means, how it came to be, or what I should
do to get correct stats on the problem boxes 2 & 4???  Any help is
greatly appreciated.

=============================================================================
Mike Lee                                                Unix Systems Admin
[EMAIL PROTECTED]                                   Homegain.com

=============================================================================

Finally I'll post the `cat /proc/stat` from all 4 boxes at the time those
othere data were gathered:

Box #1
cpu  6815648 22265293 14560643 393652004
cpu0 2892415 10857651 7116311 197780417
cpu1 3923233 11407642 7444332 195871587
disk 9751244 0 0 0
disk_rio 943738 0 0 0
disk_wio 8807506 0 0 0
disk_rblk 6028904 0 0 0
disk_wblk 45698060 0 0 0
disk_pgin 17422726 0 0 0
disk_pgout 52094458 0 0 0
page 3198941 10845620
swap 1 0
intr 908904501 218646794 2 0 3 391 0 0 79345779 1 0 0 0 0 1 30 0 0
109510890 0 0 491982848 0 0 0 9417732 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0
ctxt 1884284700
btime 1015503988
processes 9259259


Box #2
cpu  77951038 92785486 89782075 4141054443
cpu0 38036170 42881571 43823112 2076045668
cpu1 39914868 49903915 45958963 2065008775
disk 39082420 0 0 0
disk_rio 725685 0 0 0
disk_wio 38356735 0 0 0
disk_rblk 4901232 0 0 0
disk_wblk 174330620 0 0 0
disk_pgin 15807422 0 0 0
disk_pgout 195690674 0 0 0
page 4199023 44233024
swap 167691 87945
intr 3045560553 2200786521 5190 0 3 917763 0 0 650561234 1 0 0 0 0 1 30
0 0 1106845836 0 0 3343233806 0 0 0 38177434 30 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0
ctxt 3583751914
btime 995682677
processes 95573857


Box #3
cpu  27085616 6434384 212834029 3565588659
cpu0 12879380 3221849 106202624 1783667491
cpu1 14206236 3212535 106631405 1781921168
disk
0 0 0
disk_rio 543345 0 0 0
disk_wio 163810810 0 0 0
disk_rblk 3572604 0 0 0
disk_wblk 353498906 0 0 0
disk_pgin 12345368 0 0 0
disk_pgout 437197320 0 0 0
page 3515662 202296516
swap 1 0
intr 1034590436 1905971344 3 0 3 44068 0 0 764316976 1 0 0 0 0 1 30 0 0
957178180 0 0 1539652702 0 0 0 162394394 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0
ctxt 2248435364
btime 998631743
processes 66291265


Box #4
cpu  48325607 54578767 135160103 4163716107
cpu0 24034142 26921114 67556904 2082378132
cpu1 24291465 27657653 67603200 2081337974
disk 54155788 0 0 0
disk_rio 741140 0 0 0
disk_wio 53414648 0 0 0
disk_rblk 5054056 0 0 0
disk_wblk 224983504 0 0 0
disk_pgin 18623390 0 0 0
disk_pgout 250387210 0 0 0
page 4575235 60555155
swap 417212 267467
intr 2838460878 2200890292 3 0 3 968999 0 0 855746431 1 0 0 0 0 1 30 0
0 1086958781 0 0 2935979554 0 0 0 52884049 30 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0
ctxt 1874175302
btime 995682511
processes 102791618





-----------------------------------------------------------------------------

Mike Lee                                                Unix Systems Admin
[EMAIL PROTECTED]                                   Homegain.com

=============================================================================




_______________________________________________
Redhat-list mailing list
[EMAIL PROTECTED]
https://listman.redhat.com/mailman/listinfo/redhat-list

Reply via email to