Have you considered your basic hardware configuration -- with all the high speed I/O you need a lot of buffer memory --is your 128M ram anywhere near sufficient to handle this job?
Note: the first error in your trace indicates a memory (or lack of it) problem. Streaming I/O such as for tape drives only increases the need for real memory. I have found linux systems get a little weird when they start running out of memory and the symptoms seem to point elsewhere. Just a brief thought from my experience. Hal K. [EMAIL PROTECTED] Please visit the following website: http://www.rv-portal.com ----- Original Message ----- From: "Jonathan Johnson" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, August 19, 2002 10:49 AM Subject: Diagnosing an elusive fault on a critical system [long] > Dear Red Hatters, > > Sorry to join the list like this, but... > > I am a bad spot. Our company has taken the step to replace an ancient > Sun Sparc II and a recently-compromised RH 6.0 network server with a > new RH 7.2 omni-server with software RAID, backup tape, VPN, updated > network services and increased security. So far so good. > > Unfortunately, it's taken longer than expected to configure and migrate > all the services, so the costs are running up and the management is > less than enthusiastic with things at this point. But this isn't the > real problem. > > The REAL problem is that this machine has been crashing periodically. > It does not always crash in the same way. It does consistently crash > on Saturday mornings, toward the end of a lengthy Amanda amdump run. > > The system was up and running since the installation in early May. A > 2.4.9-31mppe kernel has been in use since the third week of May. > Amanda backups of local drives began at the end of May, with the > addition of NT server shares in early June. There was a lengthy power > outage June 14th - 15th, but this system was powered down before the > UPS gave out. The RH 6.0 network server and firewall have more > recently been added as Amanda client systems. > > Since the first two anomalies were under heavy load and completely > different, I guessed there was a heat issue (see system specs below for > the logic of this). There was a silent, hard crash the first time > (June 29, a little after 1:30 am), and hard drive errors the second > time (July 20). > > Logs from hard drive errors: > > Jul 20 06:17:51 pegasus kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } > Jul 20 06:17:51 pegasus kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=30879791, sector=548864 > Jul 20 06:17:51 pegasus kernel: end_request: I/O error, dev 03:07 (hda), sector 548864 > Jul 20 06:17:51 pegasus kernel: raid5: Disk failure on hda7, disabling device. Operation continuing on 3 devices > > After I removed and added /dev/hda7, I ran a CVS update of /etc (like > the author of the recent Linux Journal article, I keep my life in a CVS > archive). More disk errors: > > Jul 20 14:15:05 pegasus kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } > Jul 20 14:15:05 pegasus kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=343674, sector=22272 > Jul 20 14:15:05 pegasus kernel: end_request: I/O error, dev 03:05 (hda), sector 22272 > Jul 20 14:15:05 pegasus kernel: raid5: Disk failure on hda5, disabling device. Operation continuing on 3 devices > > I removed and added /hda5 and all was well. > > These drive errors were completely transient; I had no more disk errors > afterward although we continued to run in this state through the end of > July, when I rebooted after updating the openssl RPMs. Weird, isn't > it? Surely something was overheating, right? We changed the office > thermostat to leave the fans running 24/7, though the air conditioners > are still at 78F except between 6am and 10pm weekdays, when it cools > down to 74F. > > After a third crash under the same circumstances (Aug 10), involving a > long run of "kernel: Oops" messages this time, I ordered additional > fans and pulled the cover off the case to let it breathe freely until I > could take it down and install the fans. > > Guess what -- it crashed again last Saturday morning. More "kernel: > Oops" messages. I guess it probably isn't a heat dissipation > problem... <:-( > > I won't include all the "kernel: Oops" dumps, but here are the initial > ones from the August 10 and 17 crashes: > > Aug 10 05:04:19 pegasus sendbackup[9944]: error [/bin/tar got signal 11, index got signal 11, compress got signal 11] > Aug 10 05:04:19 pegasus kernel: Unable to handle kernel paging request at virtual address 56aabf94 >-- CLIPPED -- > HARDWARE: > > Motherboard: Tyan Trinity K7 (S2380) > CPU: AMD Athlon Slot A 750 MHz > Case/PS: InWin ATX Full Tower Case Q500 w/300w PS and > added front intake fan > Memory: 128 Mb > Storage: Promise (PDC20267) PCI IDE controller > Tekram SCSI controller (sym53c8xx: 53c875 > detected with Tekram NVRAM) > 4 IBM-DTLA-307030 (30 Gb) drives (hd[aceg]) > Pioneer DVD-ROM ATAPIModel DVD-106S 012 (hdb) > Sony SDX-300C AIT SCSI tape > Exabyte EXB-8200 (tried, unsuccessfully, to > reuse 8mm dump tapes from the Sun server) > Networking: SMC1211TX EZCard 10/100 (RealTek RTL8139) --CLIPPED-- -- redhat-list mailing list unsubscribe mailto:[EMAIL PROTECTED]?subject=unsubscribe https://listman.redhat.com/mailman/listinfo/redhat-list