Re: Diagnosing an elusive fault on a critical system [long]

Hal Kellerman Mon, 19 Aug 2002 13:23:30 -0700

Have you considered your basic hardware configuration -- with all the high
speed I/O you need a lot of buffer memory --is your 128M ram anywhere near
sufficient to handle this job?


Note: the first error in your trace indicates a memory (or lack of it)
problem.

Streaming I/O such as for tape drives only increases the need for real
memory.

I have found linux systems get a little weird when they start running out of
memory and the symptoms seem to point elsewhere.

Just a brief thought from my experience.

Hal K.

[EMAIL PROTECTED]

Please visit the following website: http://www.rv-portal.com



----- Original Message -----
From: "Jonathan Johnson" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, August 19, 2002 10:49 AM
Subject: Diagnosing an elusive fault on a critical system [long]


> Dear Red Hatters,
>
> Sorry to join the list like this, but...
>
> I am a bad spot.  Our company has taken the step to replace an ancient
> Sun Sparc II and a recently-compromised RH 6.0 network server with a
> new RH 7.2 omni-server with software RAID, backup tape, VPN, updated
> network services and increased security.  So far so good.
>
> Unfortunately, it's taken longer than expected to configure and migrate
> all the services, so the costs are running up and the management is
> less than enthusiastic with things at this point.  But this isn't the
> real problem.
>
> The REAL problem is that this machine has been crashing periodically.
> It does not always crash in the same way.  It does consistently crash
> on Saturday mornings, toward the end of a lengthy Amanda amdump run.
>
> The system was up and running since the installation in early May.  A
> 2.4.9-31mppe kernel has been in use since the third week of May.
> Amanda backups of local drives began at the end of May, with the
> addition of NT server shares in early June.  There was a lengthy power
> outage June 14th - 15th, but this system was powered down before the
> UPS gave out.  The RH 6.0 network server and firewall have more
> recently been added as Amanda client systems.
>
> Since the first two anomalies were under heavy load and completely
> different, I guessed there was a heat issue (see system specs below for
> the logic of this).  There was a silent, hard crash the first time
> (June 29, a little after 1:30 am), and hard drive errors the second
> time (July 20).
>
> Logs from hard drive errors:
>
>   Jul 20 06:17:51 pegasus kernel: hda: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
>   Jul 20 06:17:51 pegasus kernel: hda: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=30879791, sector=548864
>   Jul 20 06:17:51 pegasus kernel: end_request: I/O error, dev 03:07 (hda),
sector 548864
>   Jul 20 06:17:51 pegasus kernel: raid5: Disk failure on hda7, disabling
device. Operation continuing on 3 devices
>
> After I removed and added /dev/hda7, I ran a CVS update of /etc (like
> the author of the recent Linux Journal article, I keep my life in a CVS
> archive).  More disk errors:
>
>   Jul 20 14:15:05 pegasus kernel: hda: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
>   Jul 20 14:15:05 pegasus kernel: hda: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=343674, sector=22272
>   Jul 20 14:15:05 pegasus kernel: end_request: I/O error, dev 03:05 (hda),
sector 22272
>   Jul 20 14:15:05 pegasus kernel: raid5: Disk failure on hda5, disabling
device. Operation continuing on 3 devices
>
> I removed and added /hda5 and all was well.
>
> These drive errors were completely transient; I had no more disk errors
> afterward although we continued to run in this state through the end of
> July, when I rebooted after updating the openssl RPMs.  Weird, isn't
> it?  Surely something was overheating, right?  We changed the office
> thermostat to leave the fans running 24/7, though the air conditioners
> are still at 78F except between 6am and 10pm weekdays, when it cools
> down to 74F.
>
> After a third crash under the same circumstances (Aug 10), involving a
> long run of "kernel: Oops" messages this time, I ordered additional
> fans and pulled the cover off the case to let it breathe freely until I
> could take it down and install the fans.
>
> Guess what -- it crashed again last Saturday morning.  More "kernel:
> Oops" messages.  I guess it probably isn't a heat dissipation
> problem...  <:-(
>
> I won't include all the "kernel: Oops" dumps, but here are the initial
> ones from the August 10 and 17 crashes:
>
>   Aug 10 05:04:19 pegasus sendbackup[9944]: error [/bin/tar got signal 11,
index got signal 11, compress got signal 11]
>   Aug 10 05:04:19 pegasus kernel: Unable to handle kernel paging request
at virtual address 56aabf94
>-- CLIPPED --
> HARDWARE:
>
>   Motherboard:          Tyan Trinity K7 (S2380)
>   CPU:                  AMD Athlon Slot A 750 MHz
>   Case/PS:              InWin ATX Full Tower Case Q500 w/300w PS and
>                         added front intake fan
>   Memory:               128 Mb
>   Storage:              Promise (PDC20267) PCI IDE controller
>                         Tekram SCSI controller (sym53c8xx: 53c875
>                           detected with Tekram NVRAM)
>                         4 IBM-DTLA-307030 (30 Gb) drives (hd[aceg])
>                         Pioneer DVD-ROM ATAPIModel DVD-106S 012 (hdb)
>                         Sony SDX-300C AIT SCSI tape
>                         Exabyte EXB-8200 (tried, unsuccessfully, to
>                           reuse 8mm dump tapes from the Sun server)
>   Networking:           SMC1211TX EZCard 10/100 (RealTek RTL8139)
--CLIPPED--




-- 
redhat-list mailing list
unsubscribe mailto:[EMAIL PROTECTED]?subject=unsubscribe
https://listman.redhat.com/mailman/listinfo/redhat-list

Re: Diagnosing an elusive fault on a critical system [long]

Reply via email to