Hello:
This is a long post, and I apologize for that, but I wanted to include all possible data related to the matter. Thanks in advance for any help on this. I've run into something that I've never seen in all my years of working with RedHat Linux. We have 4 VALinux 1220 servers, dual PIII 800MHz PCU's, 1GB RAM, 1 GB swap. All are running VALinux's enhanced RH 6.2.1 OS, kernel 2.2.18pre11-va2.1smp #2 SMP Thu May 10 13:31:39 PDT 2001 i686 unknown. The kernel has been recompiled once to increase NR_TASKS from 512 to 2560 and MAX_TASKS_PER_USER to 2048 in /usr/src/linux/include/linux/tasks.h - but everything else left the same. All 4 pretty much run the same applications, though there are slight differences - that wouldn't cause this problem. There is something very strange when I run `procinfo` or `sar` on boxes 2 & 4, they give ridiculous results; while 1 & 3 give normal looking results. 2 & 4 both show 0% CPU for user, nice, system, and idle, example from #2 Memory: Total Used Free Shared Buffers Cached Mem: 1048132 879876 168256 0 148252 53452 Swap: 1052248 94336 957912 Bootup: Fri Jul 20 19:31:18 2001 Load average: 0.24 0.18 0.17 1/528 18091 user : 9d 0:31:58.65 0.0% page in : 4199302 disk 1: 725731r38379141w nice : 10d 17:54:27.50 0.0% page out: 44259101 system: 10d 9:32:42.64 0.0% swap in : 167943 idle : 479d 9:42:48.65 0.0% swap out: 87945 uptime: 254d 18:50:58.71 context :3591061245 Note all 0% for cpu usage. Box #4 shows similar output. Here is procinfo from box #1 (#3 shows similar output) Memory: Total Used Free Shared Buffers Cached Mem: 1048132 890748 157384 0 177436 80900 Swap: 1052248 0 1052248 Bootup: Thu Mar 7 04:26:28 2002 Load average: 0.25 0.28 0.27 6/455 24648 user : 18:56:12.69 1.5% page in : 3198992 disk 1: 943746r 8830308w nice : 2d 14:09:03.87 5.1% page out: 10873687 system: 1d 16:36:38.39 3.3% swap in : 1 idle : 45d 16:16:29.43 90.0% swap out: 0 uptime: 25d 8:59:12.18 context :1899728374 These CPU percentages look normal. -------------------------------------------------------- sar -u and sar -U 0|1 show similar weirdness. 1 & 3 look normal 2 & 4 are ridiculous numbers. sar -u on box 1 looks OK: 12:28:11 %user %nice %system %idle 12:28:11 0.71% 0.16% 5.58% 93.55% sar -u on box 2 makes no sense: 12:27:59 %user %nice %system %idle 12:27:59 45.11% 50.96% 126.21% 3887.72% Once again box 3 looks like box 1, box 4 looks like box 2 outputs. ---------------------------------------------------------- Next I did a `strace -v sar -u > file 2>&1` on boxes 4 & 3, output below: I'll paste the last few lines below, but compared line for line, the straces look almost identical... box 4 open("/proc/stat", O_RDONLY) = 3 fstat64(0x3, 0xbfffddfc) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0 x40016000 read(3, "cpu 48325711 54591314 135186636"..., 1024) = 904 read(3, "", 1024) = 0 close(3) = 0 munmap(0x40016000, 4096) = 0 time([1017692879]) = 1017692879 write(1, "Linux 2.2.18pre11-va2.1smp (hg-p"..., 170Linux 2.2.18pre11-va2.1smp (h g-prd-04.homegain.com) 04/01/02 12:27:59 %user %nice %system %idle 12:27:59 45.11% 50.96% 126.21% 3887.72% box 3 open("/proc/stat", O_RDONLY) = 3 fstat64(0x3, 0xbfffddfc) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0 x40016000 read(3, "cpu 27090298 6436488 212908388 "..., 1024) = 894 read(3, "", 1024) = 0 close(3) = 0 munmap(0x40016000, 4096) = 0 time([1017692891]) = 1017692891 write(1, "Linux 2.2.18pre11-va2.1smp (hg-p"..., 169Linux 2.2.18pre11-va2.1smp (h g-prd-03.homegain.com) 04/01/02 12:28:11 %user %nice %system %idle 12:28:11 0.71% 0.16% 5.58% 93.55% Both of these straces start at "open /proc/stat", they look amost the same. The one line "read(3, "cpu N1 N2 N3..." on box 4 yields 904, while on box 3 it yields 894. That and there is a slight difference in the time (probably due to slightly different uptimes). Yet, look at the sar results at the very bottom. Box 3 shows normal, box 4 shows crazy results for sar. A similar strace between box 1 and box 2 shows identical results except that one line "read(3, "cpu N1 N2 N3..." on box 2 is 904, which is the same as on box 4; these being the problem boxes. On box 1 it is 871, 23 less than on box 3. box 2 read(3, "cpu 77951409 92818862 89811526 "..., 1024) = 904 box 1 read(3, "cpu 6816459 22322346 14591024 3"..., 1024) = 871 Whatever this line indicates and whatever the = 904 or 894/871 mean, it seems like the only clue to this whole mystery is that both box 2 and box 4 come up with the same result, 904, for that one line. And both box 2 and box 4 show crazy results for `sar -u`. Does anyone know what this means, how it came to be, or what I should do to get correct stats on the problem boxes 2 & 4??? Any help is greatly appreciated. ============================================================================= Mike Lee Unix Systems Admin [EMAIL PROTECTED] Homegain.com ============================================================================= Finally I'll post the `cat /proc/stat` from all 4 boxes at the time those othere data were gathered: Box #1 cpu 6815648 22265293 14560643 393652004 cpu0 2892415 10857651 7116311 197780417 cpu1 3923233 11407642 7444332 195871587 disk 9751244 0 0 0 disk_rio 943738 0 0 0 disk_wio 8807506 0 0 0 disk_rblk 6028904 0 0 0 disk_wblk 45698060 0 0 0 disk_pgin 17422726 0 0 0 disk_pgout 52094458 0 0 0 page 3198941 10845620 swap 1 0 intr 908904501 218646794 2 0 3 391 0 0 79345779 1 0 0 0 0 1 30 0 0 109510890 0 0 491982848 0 0 0 9417732 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ctxt 1884284700 btime 1015503988 processes 9259259 Box #2 cpu 77951038 92785486 89782075 4141054443 cpu0 38036170 42881571 43823112 2076045668 cpu1 39914868 49903915 45958963 2065008775 disk 39082420 0 0 0 disk_rio 725685 0 0 0 disk_wio 38356735 0 0 0 disk_rblk 4901232 0 0 0 disk_wblk 174330620 0 0 0 disk_pgin 15807422 0 0 0 disk_pgout 195690674 0 0 0 page 4199023 44233024 swap 167691 87945 intr 3045560553 2200786521 5190 0 3 917763 0 0 650561234 1 0 0 0 0 1 30 0 0 1106845836 0 0 3343233806 0 0 0 38177434 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ctxt 3583751914 btime 995682677 processes 95573857 Box #3 cpu 27085616 6434384 212834029 3565588659 cpu0 12879380 3221849 106202624 1783667491 cpu1 14206236 3212535 106631405 1781921168 disk 0 0 0 disk_rio 543345 0 0 0 disk_wio 163810810 0 0 0 disk_rblk 3572604 0 0 0 disk_wblk 353498906 0 0 0 disk_pgin 12345368 0 0 0 disk_pgout 437197320 0 0 0 page 3515662 202296516 swap 1 0 intr 1034590436 1905971344 3 0 3 44068 0 0 764316976 1 0 0 0 0 1 30 0 0 957178180 0 0 1539652702 0 0 0 162394394 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ctxt 2248435364 btime 998631743 processes 66291265 Box #4 cpu 48325607 54578767 135160103 4163716107 cpu0 24034142 26921114 67556904 2082378132 cpu1 24291465 27657653 67603200 2081337974 disk 54155788 0 0 0 disk_rio 741140 0 0 0 disk_wio 53414648 0 0 0 disk_rblk 5054056 0 0 0 disk_wblk 224983504 0 0 0 disk_pgin 18623390 0 0 0 disk_pgout 250387210 0 0 0 page 4575235 60555155 swap 417212 267467 intr 2838460878 2200890292 3 0 3 968999 0 0 855746431 1 0 0 0 0 1 30 0 0 1086958781 0 0 2935979554 0 0 0 52884049 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ctxt 1874175302 btime 995682511 processes 102791618 ----------------------------------------------------------------------------- Mike Lee Unix Systems Admin [EMAIL PROTECTED] Homegain.com ============================================================================= _______________________________________________ Redhat-list mailing list [EMAIL PROTECTED] https://listman.redhat.com/mailman/listinfo/redhat-list