----- Original Message -----
From: "Jonathan Johnson" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, August 19, 2002 8:49 PM
Subject: Diagnosing an elusive fault on a critical system [long]


>
> The REAL problem is that this machine has been crashing periodically.
> It does not always crash in the same way.  It does consistently crash
> on Saturday mornings, toward the end of a lengthy Amanda amdump run.
>
> The system was up and running since the installation in early May.  A
> 2.4.9-31mppe kernel has been in use since the third week of May.

2.4.18. use it. stable as a rock. And it contains mppe patches. You may
patch the kernel and compile it again,

> Amanda backups of local drives began at the end of May, with the
> addition of NT server shares in early June.  There was a lengthy power
> outage June 14th - 15th, but this system was powered down before the
> UPS gave out.  The RH 6.0 network server and firewall have more
> recently been added as Amanda client systems.
>
> Since the first two anomalies were under heavy load and completely
> different, I guessed there was a heat issue (see system specs below for
> the logic of this).  There was a silent, hard crash the first time
> (June 29, a little after 1:30 am), and hard drive errors the second
> time (July 20).
>
> Logs from hard drive errors:
>
>   Jul 20 06:17:51 pegasus kernel: hda: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
>   Jul 20 06:17:51 pegasus kernel: hda: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=30879791, sector=548864
>   Jul 20 06:17:51 pegasus kernel: end_request: I/O error, dev 03:07 (hda),
sector 548864
>   Jul 20 06:17:51 pegasus kernel: raid5: Disk failure on hda7, disabling
device. Operation continuing on 3 devices
>
> After I removed and added /dev/hda7, I ran a CVS update of /etc (like
> the author of the recent Linux Journal article, I keep my life in a CVS
> archive).  More disk errors:
>
>   Jul 20 14:15:05 pegasus kernel: hda: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
>   Jul 20 14:15:05 pegasus kernel: hda: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=343674, sector=22272
>   Jul 20 14:15:05 pegasus kernel: end_request: I/O error, dev 03:05 (hda),
sector 22272
>   Jul 20 14:15:05 pegasus kernel: raid5: Disk failure on hda5, disabling
device. Operation continuing on 3 devices
>
> I removed and added /hda5 and all was well.
>
>
> Guess what -- it crashed again last Saturday morning.  More "kernel:
> Oops" messages.  I guess it probably isn't a heat dissipation
> problem...  <:-(
>
> I won't include all the "kernel: Oops" dumps, but here are the initial
> ones from the August 10 and 17 crashes:
>
> Before I drone on with more data, some thoughts I have had:
>
>   - Could the power supply be inadequate?

NO

>
>   - Does the custom kernel have a problem (there _are_ newer kernels
>     out there, but I've avoided building my own up to this point and we
>     need the MPPE patches)?

Maybe. Use 2.4.18 and be very carefull what CPU architecture you compile it
for.

Also, change the RAM. This "oops" of the kernel with page swapping  when
untar'ing seems to be either in:

1. CPU
2. RAM (try to test the RAM)
3. MB's cache. Turn it off if it is on.

>   - What's the problem with Amanda runs?  Sure the CPU, disk and
>     network are busy, and there's lots of activity on the SCSI tape,
>     but that's life, buddy!

high load. In the past I had buggy drivers for Intel I2o RAID controller
(the last kernel was 2.4.2).
They worked perfectly with the exception that when a very big download (over
200MB) began.
Then the cache had gone to the hell and the scsi disks began to blow out
with a strange message for
bad_seek and etc...

Again:

Most possible: System Cache (hardware), some bug in the kernel for your
system.
Less possible: RAM.

>
> HARDWARE:
>
>   Motherboard:          Tyan Trinity K7 (S2380)
>   CPU:                  AMD Athlon Slot A 750 MHz
>   Case/PS:              InWin ATX Full Tower Case Q500 w/300w PS and
>                         added front intake fan
>   Memory:               128 Mb
>   Storage:              Promise (PDC20267) PCI IDE controller
>                         Tekram SCSI controller (sym53c8xx: 53c875
>                           detected with Tekram NVRAM)
>                         4 IBM-DTLA-307030 (30 Gb) drives (hd[aceg])
>                         Pioneer DVD-ROM ATAPIModel DVD-106S 012 (hdb)
>                         Sony SDX-300C AIT SCSI tape
>                         Exabyte EXB-8200 (tried, unsuccessfully, to
>                           reuse 8mm dump tapes from the Sun server)
>   Networking:           SMC1211TX EZCard 10/100 (RealTek RTL8139)
>
> SOFTWARE:
>
> This is a Red Hat 7.2 system, with all RPMS directly from install or
> Red Hat updates, with the exception of MPPE RPMS from
> ftp://ftp.planetmirror.com/pub/mppe:
>
>   kernel-2.4.9-31mppe.i386.rpm
>   kernel-doc-2.4.9-31mppe.i386.rpm
>   kernel-headers-2.4.9-31mppe.i386.rpm
>   kernel-source-2.4.9-31mppe.i386.rpm
>   ppp-2.4.1-3mppe.i386.rpm
>   pptpd-1.1.3-1.i386.rpm
>
>   Kernel:               Linux version 2.4.9-31mppe (root@richard) (gcc
>                         version 2.96 20000731 (Red Hat Linux 7.1
>                         2.96-98)) #1 Tue Mar 5 18:47:37 CET 2002
>
> MISC. KERNEL INFO:
>
>   $ cat /proc/interrupts
>              CPU0
>     0:    1471213          XT-PIC  timer
>     1:          6          XT-PIC  keyboard
>     2:          0          XT-PIC  cascade
>     4:         16          XT-PIC  serial
>     8:          1          XT-PIC  rtc
>    10:   13562417          XT-PIC  ide2, ide3
>    11:         30          XT-PIC  sym53c8xx
>    12:      34156          XT-PIC  eth0
>    14:    7376543          XT-PIC  ide0
>    15:    7382515          XT-PIC  ide1
>   NMI:          0
>   ERR:          0
>   $ cat /proc/iomem
>   00000000-0009ffff : System RAM
>   000a0000-000bffff : Video RAM area
>   000c0000-000c7fff : Video ROM
>   000c8000-000c9fff : Extension ROM
>   000ca000-000ca1ff : Extension ROM
>   000f0000-000fffff : System ROM
>   00100000-07feffff : System RAM
>     00100000-002b270a : Kernel code
>     002b270b-002c92eb : Kernel data
>   07ff0000-07ff2fff : ACPI Non-volatile Storage
>   07ff3000-07ffffff : ACPI Tables
>   d0000000-d3ffffff : VIA Technologies, Inc. VT8371 [KX133]
>   d4000000-d7ffffff : PCI Bus #01
>     d4000000-d4ffffff : ATI Technologies Inc 3D Rage Pro AGP 1X/2X
>     d6000000-d6000fff : ATI Technologies Inc 3D Rage Pro AGP 1X/2X
>   d9000000-d901ffff : Promise Technology, Inc. 20267
>   d9020000-d90200ff : Accton Technology Corporation SMC2-1211TX
>     d9020000-d90200ff : 8139too
>   d9021000-d90210ff : Symbios Logic Inc. (formerly NCR) 53c875
>   d9022000-d9022fff : Symbios Logic Inc. (formerly NCR) 53c875
>   ffff0000-ffffffff : reserved
>   $ cat /proc/ioports
>   0000-001f : dma1
>   0020-003f : pic1
>   0040-005f : timer
>   0060-006f : keyboard
>   0070-007f : rtc
>   0080-008f : dma page reg
>   00a0-00bf : pic2
>   00c0-00df : dma2
>   00f0-00ff : fpu
>   0170-0177 : ide1
>   01f0-01f7 : ide0
>   02f8-02ff : serial(auto)
>   0376-0376 : ide1
>   03c0-03df : vga+
>   03f6-03f6 : ide0
>   03f8-03ff : serial(auto)
>   0cf8-0cff : PCI conf1
>   4000-40ff : VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
>   5000-500f : VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
>     5000-5007 : via2-smbus
>   6000-607f : VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
>     6000-607f : via686a-sensors
>   9000-9fff : PCI Bus #01
>     9000-90ff : ATI Technologies Inc 3D Rage Pro AGP 1X/2X
>   a000-a00f : VIA Technologies, Inc. Bus Master IDE
>     a000-a007 : ide0
>     a008-a00f : ide1
>   ac00-ac07 : Promise Technology, Inc. 20267
>     ac00-ac07 : ide2
>   b000-b003 : Promise Technology, Inc. 20267
>     b002-b002 : ide2
>   b400-b407 : Promise Technology, Inc. 20267
>     b400-b407 : ide3
>   b800-b803 : Promise Technology, Inc. 20267
>     b802-b802 : ide3
>   bc00-bc3f : Promise Technology, Inc. 20267
>     bc00-bc07 : ide2
>     bc08-bc0f : ide3
>     bc10-bc3f : PDC20267
>   c000-c0ff : Accton Technology Corporation SMC2-1211TX
>     c000-c0ff : 8139too
>   c400-c4ff : Symbios Logic Inc. (formerly NCR) 53c875
>     c400-c47f : sym53c8xx
>   $ cat /proc/modules
>   ppp_deflate            39008   0 (autoclean)
>   ppp_mppe               23232   2 (autoclean)
>   bsd_comp                4128   0 (autoclean)
>   ppp_async               6720   1 (autoclean)
>   ppp_generic            19240   3 (autoclean) [ppp_deflate ppp_mppe
bsd_comp ppp_async]
>   slhc                    4896   1 (autoclean) [ppp_generic]
>   via686a                 8548   0 (unused)
>   eeprom                  3040   0 (unused)
>   i2c-proc                6368   0 [via686a eeprom]
>   i2c-isa                 1156   0 (unused)
>   i2c-viapro              3848   0 (unused)
>   i2c-core               12864   0 [via686a eeprom i2c-proc i2c-isa
i2c-viapro]
>   binfmt_misc             5924   1
>   nfsd                   68512   8 (autoclean)
>   lockd                  50720   1 (autoclean) [nfsd]
>   sunrpc                 61520   1 (autoclean) [nfsd lockd]
>   autofs                 10564   0 (autoclean) (unused)
>   8139too                12672   1
>   ipchains               34568   1
>   st                     25844   0 (unused)
>   ext3                   58912   4
>   jbd                    38500   4 [ext3]
>   raid5                  16864   3
>   xor                     5912   0 [raid5]
>   raid1                  12324   1
>   sym53c8xx              55300   0 (unused)
>   sd_mod                 11836   0 (unused)
>   scsi_mod               92824   3 [st sym53c8xx sd_mod]
>

So, my advice - get 2.4.18 kernel (forget the microsoft pptp patches for
now) and compile it.
The most important point is to make a list of all your hardware + the
software requirements (in example  - software raid,
ext3 journalling) and to compile exactly what you need. Nothing more.
Then put the system into a heavy load again and watch.
And see your BIOS options.

Btw, the TRYAN Trinity motherboards had some known problems with the AGP (or
ACPI), I do not remember.
But there was some article.

Hope that helps...

Very un-nice problem. Poor you :(
I do not wish this to any sysadmin.









-- 
redhat-list mailing list
unsubscribe mailto:[EMAIL PROTECTED]?subject=unsubscribe
https://listman.redhat.com/mailman/listinfo/redhat-list

Reply via email to