Most likely ram or kernel, use a stock kernel first, change the ram if that doesn't work, if it doesn't solve it you can eliminate the most likely candidates.
-----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Jonathan Johnson Sent: Monday, August 19, 2002 1:49 PM To: [EMAIL PROTECTED] Subject: Diagnosing an elusive fault on a critical system [long] Dear Red Hatters, Sorry to join the list like this, but... I am a bad spot. Our company has taken the step to replace an ancient Sun Sparc II and a recently-compromised RH 6.0 network server with a new RH 7.2 omni-server with software RAID, backup tape, VPN, updated network services and increased security. So far so good. Unfortunately, it's taken longer than expected to configure and migrate all the services, so the costs are running up and the management is less than enthusiastic with things at this point. But this isn't the real problem. The REAL problem is that this machine has been crashing periodically. It does not always crash in the same way. It does consistently crash on Saturday mornings, toward the end of a lengthy Amanda amdump run. The system was up and running since the installation in early May. A 2.4.9-31mppe kernel has been in use since the third week of May. Amanda backups of local drives began at the end of May, with the addition of NT server shares in early June. There was a lengthy power outage June 14th - 15th, but this system was powered down before the UPS gave out. The RH 6.0 network server and firewall have more recently been added as Amanda client systems. Since the first two anomalies were under heavy load and completely different, I guessed there was a heat issue (see system specs below for the logic of this). There was a silent, hard crash the first time (June 29, a little after 1:30 am), and hard drive errors the second time (July 20). Logs from hard drive errors: Jul 20 06:17:51 pegasus kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } Jul 20 06:17:51 pegasus kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=30879791, sector=548864 Jul 20 06:17:51 pegasus kernel: end_request: I/O error, dev 03:07 (hda), sector 548864 Jul 20 06:17:51 pegasus kernel: raid5: Disk failure on hda7, disabling device. Operation continuing on 3 devices After I removed and added /dev/hda7, I ran a CVS update of /etc (like the author of the recent Linux Journal article, I keep my life in a CVS archive). More disk errors: Jul 20 14:15:05 pegasus kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } Jul 20 14:15:05 pegasus kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=343674, sector=22272 Jul 20 14:15:05 pegasus kernel: end_request: I/O error, dev 03:05 (hda), sector 22272 Jul 20 14:15:05 pegasus kernel: raid5: Disk failure on hda5, disabling device. Operation continuing on 3 devices I removed and added /hda5 and all was well. These drive errors were completely transient; I had no more disk errors afterward although we continued to run in this state through the end of July, when I rebooted after updating the openssl RPMs. Weird, isn't it? Surely something was overheating, right? We changed the office thermostat to leave the fans running 24/7, though the air conditioners are still at 78F except between 6am and 10pm weekdays, when it cools down to 74F. After a third crash under the same circumstances (Aug 10), involving a long run of "kernel: Oops" messages this time, I ordered additional fans and pulled the cover off the case to let it breathe freely until I could take it down and install the fans. Guess what -- it crashed again last Saturday morning. More "kernel: Oops" messages. I guess it probably isn't a heat dissipation problem... <:-( I won't include all the "kernel: Oops" dumps, but here are the initial ones from the August 10 and 17 crashes: Aug 10 05:04:19 pegasus sendbackup[9944]: error [/bin/tar got signal 11, index got signal 11, compress got signal 11] Aug 10 05:04:19 pegasus kernel: Unable to handle kernel paging request at virtual address 56aabf94 Aug 10 05:04:19 pegasus kernel: printing eip: Aug 10 05:04:19 pegasus kernel: c0129885 Aug 10 05:04:19 pegasus kernel: *pde = 00000000 Aug 10 05:04:19 pegasus kernel: Oops: 0000 Aug 10 05:04:19 pegasus kernel: Kernel 2.4.9-31mppe Aug 10 05:04:19 pegasus kernel: CPU: 0 Aug 10 05:04:19 pegasus kernel: EIP: 0010:[kmalloc+105/256] Tainted: P Aug 10 05:04:19 pegasus kernel: EIP: 0010:[<c0129885>] Tainted: P Aug 10 05:04:19 pegasus kernel: EFLAGS: 00010887 Aug 10 05:04:19 pegasus kernel: EIP is at kmalloc [kernel] 0x69 Aug 10 05:04:19 pegasus kernel: eax: a53507df ebx: c14c00b0 ecx: c1d6a000 edx: 00000001 Aug 10 05:04:19 pegasus kernel: esi: 00000246 edi: c14c00b0 ebp: a6a0fbe0 esp: c4bafe60 Aug 10 05:04:19 pegasus kernel: ds: 0018 es: 0018 ss: 0018 Aug 10 05:04:19 pegasus kernel: Process tar (pid: 9948, stackpage=c4baf000) Aug 10 05:04:19 pegasus kernel: Stack: 000de860 00000082 c033ca00 c033c9c0 c4bae000 00000001 00000070 00000014 Aug 10 05:04:19 pegasus kernel: c8842f1b 00000014 00000070 00000000 c4bae000 c56f5760 c6c0f200 c883d334 Aug 10 05:04:19 pegasus kernel: c88457d0 00000014 00000070 00000001 00000000 00000000 ffffffe2 c56f5760 Aug 10 05:04:19 pegasus kernel: Call Trace: [8139too:__insmod_8139too_O/lib/modules/2.4.9-31mppe/kernel/drivers/+-47 9461/96] journal_blocks_per_page_R338b3e60 [jbd] 0x6b Aug 10 05:04:19 pegasus kernel: Call Trace: [<c8842f1b>] journal_blocks_per_page_R338b3e60 [jbd] 0x6b Aug 10 05:04:19 pegasus kernel: [8139too:__insmod_8139too_O/lib/modules/2.4.9-31mppe/kernel/drivers/+-50 2988/96] journal_start_R819d0567 [jbd] 0x80 Aug 10 05:04:19 pegasus kernel: [<c883d334>] journal_start_R819d0567 [jbd] 0x80 Aug 10 05:04:19 pegasus kernel: [8139too:__insmod_8139too_O/lib/modules/2.4.9-31mppe/kernel/drivers/+-46 9040/96] __insmod_jbd_S.rodata_L96 [jbd] 0x2150 Aug 10 05:04:19 pegasus kernel: [<c88457d0>] __insmod_jbd_S.rodata_L96 [jbd] 0x2150 Aug 10 05:04:19 pegasus kernel: [8139too:__insmod_8139too_O/lib/modules/2.4.9-31mppe/kernel/drivers/+-43 5991/96] __insmod_ext3_S.text_L40936 [ext3] 0x5889 Aug 10 05:04:19 pegasus kernel: [<c884d8e9>] __insmod_ext3_S.text_L40936 [ext3] 0x5889 Aug 10 05:04:19 pegasus kernel: [___wait_on_page+141/148] ___wait_on_page [kernel] 0x8d Aug 10 05:04:19 pegasus kernel: [<c0124f11>] ___wait_on_page [kernel] 0x8d Aug 10 05:04:19 pegasus kernel: [__mark_inode_dirty+42/128] __mark_inode_dirty [kernel] 0x2a Aug 10 05:04:19 pegasus kernel: [<c01435e6>] __mark_inode_dirty [kernel] 0x2a Aug 10 05:04:19 pegasus kernel: [update_atime+74/80] update_atime [kernel] 0x4a Aug 10 05:04:19 pegasus kernel: [<c0144c5e>] update_atime [kernel] 0x4a Aug 10 05:04:19 pegasus kernel: [do_generic_file_read+1258/1268] do_generic_file_read [kernel] 0x4ea Aug 10 05:04:19 pegasus kernel: [<c0125826>] do_generic_file_read [kernel] 0x4ea Aug 10 05:04:19 pegasus kernel: [generic_file_read+124/148] generic_file_read [kernel] 0x7c Aug 10 05:04:19 pegasus kernel: [<c0125900>] generic_file_read [kernel] 0x7c Aug 10 05:04:19 pegasus kernel: [file_read_actor+0/84] file_read_actor [kernel] 0x0 Aug 10 05:04:19 pegasus kernel: [<c0125830>] file_read_actor [kernel] 0x0 Aug 10 05:04:19 pegasus kernel: [sys_read+149/244] sys_read [kernel] 0x95 Aug 10 05:04:19 pegasus kernel: [<c0131de1>] sys_read [kernel] 0x95 Aug 10 05:04:19 pegasus kernel: [sys_open+121/164] sys_open [kernel] 0x79 Aug 10 05:04:19 pegasus kernel: [<c01318f9>] sys_open [kernel] 0x79 Aug 10 05:04:19 pegasus kernel: [system_call+51/56] system_call [kernel] 0x33 Aug 10 05:04:19 pegasus kernel: [<c0106e0f>] system_call [kernel] 0x33 Aug 10 05:04:19 pegasus kernel: Aug 10 05:04:19 pegasus kernel: Aug 10 05:04:19 pegasus kernel: Code: 8b 44 81 18 03 69 0c 89 41 14 40 75 26 8b 41 04 8b 11 89 42 Aug 17 06:55:50 pegasus kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000 Aug 17 06:55:50 pegasus kernel: printing eip: Aug 17 06:55:50 pegasus kernel: c012aa6b Aug 17 06:55:50 pegasus kernel: *pde = 00000000 Aug 17 06:55:50 pegasus kernel: Oops: 0002 Aug 17 06:55:50 pegasus kernel: Kernel 2.4.9-31mppe Aug 17 06:55:50 pegasus kernel: CPU: 0 Aug 17 06:55:50 pegasus kernel: EIP: 0010:[reclaim_page+147/936] Tainted: P Aug 17 06:55:50 pegasus kernel: EIP: 0010:[<c012aa6b>] Tainted: P Aug 17 06:55:50 pegasus kernel: EFLAGS: 00010282 Aug 17 06:55:50 pegasus kernel: EIP is at reclaim_page [kernel] 0x93 Aug 17 06:55:50 pegasus kernel: eax: c02b5c9c ebx: c10ae9c0 ecx: c10ae9c0 edx: 00000000 Aug 17 06:55:50 pegasus kernel: esi: c10ae9a4 edi: 000000cd ebp: c02b5c70 esp: c3761ebc Aug 17 06:55:50 pegasus kernel: ds: 0018 es: 0018 ss: 0018 Aug 17 06:55:50 pegasus kernel: Process tar (pid: 590, stackpage=c3761000) Aug 17 06:55:50 pegasus sendbackup[586]: error [/bin/tar got signal 11] Aug 17 06:55:50 pegasus kernel: Stack: c02b5c70 c02b5e34 00000001 00000001 c012c529 c02b5c70 00000000 c02b5e3c Aug 17 06:55:50 pegasus kernel: 00000000 000000d2 c012c629 c02b5e30 00000000 00000001 00000001 00000001 Aug 17 06:55:50 pegasus kernel: c02b5e30 c7f2b804 00000000 c555d158 00000000 c0125712 0000019d 00000001 Aug 17 06:55:50 pegasus kernel: Call Trace: [__alloc_pages_limit+113/152] __alloc_pages_limit [kernel] 0x71 Aug 17 06:55:50 pegasus kernel: Call Trace: [<c012c529>] __alloc_pages_limit [kernel] 0x71 Aug 17 06:55:51 pegasus kernel: [__alloc_pages+197/600] __alloc_pages [kernel] 0xc5 Aug 17 06:55:51 pegasus kernel: [<c012c629>] __alloc_pages [kernel] 0xc5 Aug 17 06:55:51 pegasus kernel: [do_generic_file_read+982/1268] do_generic_file_read [kernel] 0x3d6 Aug 17 06:55:51 pegasus kernel: [<c0125712>] do_generic_file_read [kernel] 0x3d6 Aug 17 06:55:51 pegasus kernel: [generic_file_read+124/148] generic_file_read [kernel] 0x7c Aug 17 06:55:51 pegasus kernel: [<c0125900>] generic_file_read [kernel] 0x7c Aug 17 06:55:51 pegasus kernel: [file_read_actor+0/84] file_read_actor [kernel] 0x0 Aug 17 06:55:51 pegasus kernel: [<c0125830>] file_read_actor [kernel] 0x0 Aug 17 06:55:51 pegasus kernel: [sys_read+149/244] sys_read [kernel] 0x95 Aug 17 06:55:51 pegasus kernel: [<c0131de1>] sys_read [kernel] 0x95 Aug 17 06:55:51 pegasus kernel: [sys_open+121/164] sys_open [kernel] 0x79 Aug 17 06:55:51 pegasus kernel: [<c01318f9>] sys_open [kernel] 0x79 Aug 17 06:55:51 pegasus kernel: [system_call+51/56] system_call [kernel] 0x33 Aug 17 06:55:51 pegasus kernel: [<c0106e0f>] system_call [kernel] 0x33 Aug 17 06:55:51 pegasus kernel: Aug 17 06:55:51 pegasus kernel: Aug 17 06:55:51 pegasus kernel: Code: 89 02 c7 41 04 00 00 00 00 b8 0b 00 00 00 c7 46 1c 00 00 00 What then? A precise diagnosis of the problem is critical. A solution is overdue. We can't afford to kill this project, it needs to be finished, but this system just isn't reliable enough to be the cornerstone of our corporate network. We've had almost 100% reliability from our old RH6.0 network server and its firewalling companion. (Good thing, too, or we'd probably not touch Linux again for quite a while based on our current situation alone.) I've run the lm_sensors setup and tried making sensors measurements every 10 minutes (like /usr/lib/sa/sa1), but after a day or so the temperatures stay fixed and I start getting syslog messages about bus collisions from i2c-viapro.o... Thus, no sensor data from actual crashes. Here's some basic information, and I'd love to know what additional information would be most effective in pinpointing the big, bad bug in this valuable system. Before I drone on with more data, some thoughts I have had: - Could the power supply be inadequate? - Does the custom kernel have a problem (there _are_ newer kernels out there, but I've avoided building my own up to this point and we need the MPPE patches)? - What's the problem with Amanda runs? Sure the CPU, disk and network are busy, and there's lots of activity on the SCSI tape, but that's life, buddy! HARDWARE: Motherboard: Tyan Trinity K7 (S2380) CPU: AMD Athlon Slot A 750 MHz Case/PS: InWin ATX Full Tower Case Q500 w/300w PS and added front intake fan Memory: 128 Mb Storage: Promise (PDC20267) PCI IDE controller Tekram SCSI controller (sym53c8xx: 53c875 detected with Tekram NVRAM) 4 IBM-DTLA-307030 (30 Gb) drives (hd[aceg]) Pioneer DVD-ROM ATAPIModel DVD-106S 012 (hdb) Sony SDX-300C AIT SCSI tape Exabyte EXB-8200 (tried, unsuccessfully, to reuse 8mm dump tapes from the Sun server) Networking: SMC1211TX EZCard 10/100 (RealTek RTL8139) SOFTWARE: This is a Red Hat 7.2 system, with all RPMS directly from install or Red Hat updates, with the exception of MPPE RPMS from ftp://ftp.planetmirror.com/pub/mppe: kernel-2.4.9-31mppe.i386.rpm kernel-doc-2.4.9-31mppe.i386.rpm kernel-headers-2.4.9-31mppe.i386.rpm kernel-source-2.4.9-31mppe.i386.rpm ppp-2.4.1-3mppe.i386.rpm pptpd-1.1.3-1.i386.rpm Kernel: Linux version 2.4.9-31mppe (root@richard) (gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-98)) #1 Tue Mar 5 18:47:37 CET 2002 Filesystems: $ mount /dev/md1 on / type ext3 (rw) none on /proc type proc (rw) /dev/md2 on /boot type ext3 (rw) none on /dev/pts type devpts (rw,gid=5,mode=620) /dev/md3 on /hold type ext3 (rw) /dev/md0 on /home type ext3 (rw) none on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) $ cat /proc/mdstat Personalities : [raid1] [raid5] read_ahead 1024 sectors md3 : active raid5 hda7[0] hdc7[1] hde7[2] hdg7[3] 44555904 blocks level 5, 64k chunk, algorithm 0 [4/4] [UUUU] md0 : active raid5 hda6[0] hdc6[1] hde6[2] hdg6[3] 42604032 blocks level 5, 64k chunk, algorithm 0 [4/4] [UUUU] md1 : active raid5 hda5[0] hdc5[1] hde5[2] hdg5[3] 2409408 blocks level 5, 64k chunk, algorithm 0 [4/4] [UUUU] md2 : active raid1 hda1[0] hdc1[1] 56128 blocks [2/2] [UU] unused devices: <none> $ cat /proc/swaps Filename Type Size Used Priority /dev/hde1 partition 160608 16852 -1 /dev/hdg1 partition 160608 0 -2 /dev/hda2 partition 104412 0 -3 /dev/hdc2 partition 104412 0 -4 SERVICES: SysVInit at runlevel 5: anacron apmd atd autofs crond gpm ipchains iptables isdn keytable kudzu lpd netfs network nfs nfslock ntpd p4d portmap pptpd random rawdevices sendmail smb sshd syslog wine xfs xinetd Via xinetd: amanda amanda amandaidx amidxtape imap ipop3 sgi_fam talk telnet wu-ftpd (I don't know why chkconfig shows amanda twice...) MISC. KERNEL INFO: $ cat /proc/interrupts CPU0 0: 1471213 XT-PIC timer 1: 6 XT-PIC keyboard 2: 0 XT-PIC cascade 4: 16 XT-PIC serial 8: 1 XT-PIC rtc 10: 13562417 XT-PIC ide2, ide3 11: 30 XT-PIC sym53c8xx 12: 34156 XT-PIC eth0 14: 7376543 XT-PIC ide0 15: 7382515 XT-PIC ide1 NMI: 0 ERR: 0 $ cat /proc/iomem 00000000-0009ffff : System RAM 000a0000-000bffff : Video RAM area 000c0000-000c7fff : Video ROM 000c8000-000c9fff : Extension ROM 000ca000-000ca1ff : Extension ROM 000f0000-000fffff : System ROM 00100000-07feffff : System RAM 00100000-002b270a : Kernel code 002b270b-002c92eb : Kernel data 07ff0000-07ff2fff : ACPI Non-volatile Storage 07ff3000-07ffffff : ACPI Tables d0000000-d3ffffff : VIA Technologies, Inc. VT8371 [KX133] d4000000-d7ffffff : PCI Bus #01 d4000000-d4ffffff : ATI Technologies Inc 3D Rage Pro AGP 1X/2X d6000000-d6000fff : ATI Technologies Inc 3D Rage Pro AGP 1X/2X d9000000-d901ffff : Promise Technology, Inc. 20267 d9020000-d90200ff : Accton Technology Corporation SMC2-1211TX d9020000-d90200ff : 8139too d9021000-d90210ff : Symbios Logic Inc. (formerly NCR) 53c875 d9022000-d9022fff : Symbios Logic Inc. (formerly NCR) 53c875 ffff0000-ffffffff : reserved $ cat /proc/ioports 0000-001f : dma1 0020-003f : pic1 0040-005f : timer 0060-006f : keyboard 0070-007f : rtc 0080-008f : dma page reg 00a0-00bf : pic2 00c0-00df : dma2 00f0-00ff : fpu 0170-0177 : ide1 01f0-01f7 : ide0 02f8-02ff : serial(auto) 0376-0376 : ide1 03c0-03df : vga+ 03f6-03f6 : ide0 03f8-03ff : serial(auto) 0cf8-0cff : PCI conf1 4000-40ff : VIA Technologies, Inc. VT82C686 [Apollo Super ACPI] 5000-500f : VIA Technologies, Inc. VT82C686 [Apollo Super ACPI] 5000-5007 : via2-smbus 6000-607f : VIA Technologies, Inc. VT82C686 [Apollo Super ACPI] 6000-607f : via686a-sensors 9000-9fff : PCI Bus #01 9000-90ff : ATI Technologies Inc 3D Rage Pro AGP 1X/2X a000-a00f : VIA Technologies, Inc. Bus Master IDE a000-a007 : ide0 a008-a00f : ide1 ac00-ac07 : Promise Technology, Inc. 20267 ac00-ac07 : ide2 b000-b003 : Promise Technology, Inc. 20267 b002-b002 : ide2 b400-b407 : Promise Technology, Inc. 20267 b400-b407 : ide3 b800-b803 : Promise Technology, Inc. 20267 b802-b802 : ide3 bc00-bc3f : Promise Technology, Inc. 20267 bc00-bc07 : ide2 bc08-bc0f : ide3 bc10-bc3f : PDC20267 c000-c0ff : Accton Technology Corporation SMC2-1211TX c000-c0ff : 8139too c400-c4ff : Symbios Logic Inc. (formerly NCR) 53c875 c400-c47f : sym53c8xx $ cat /proc/modules ppp_deflate 39008 0 (autoclean) ppp_mppe 23232 2 (autoclean) bsd_comp 4128 0 (autoclean) ppp_async 6720 1 (autoclean) ppp_generic 19240 3 (autoclean) [ppp_deflate ppp_mppe bsd_comp ppp_async] slhc 4896 1 (autoclean) [ppp_generic] via686a 8548 0 (unused) eeprom 3040 0 (unused) i2c-proc 6368 0 [via686a eeprom] i2c-isa 1156 0 (unused) i2c-viapro 3848 0 (unused) i2c-core 12864 0 [via686a eeprom i2c-proc i2c-isa i2c-viapro] binfmt_misc 5924 1 nfsd 68512 8 (autoclean) lockd 50720 1 (autoclean) [nfsd] sunrpc 61520 1 (autoclean) [nfsd lockd] autofs 10564 0 (autoclean) (unused) 8139too 12672 1 ipchains 34568 1 st 25844 0 (unused) ext3 58912 4 jbd 38500 4 [ext3] raid5 16864 3 xor 5912 0 [raid5] raid1 12324 1 sym53c8xx 55300 0 (unused) sd_mod 11836 0 (unused) scsi_mod 92824 3 [st sym53c8xx sd_mod] Thanks in advance, especially if you actually read this far!! Only a true Linux fan would have stayed awake to this, the 378th line of this message. :) Regards, Jonathan -- / Jonathan R. Johnson | "Every word of God is flawless." \ | Minnetonka Software, Inc. | -- Proverbs 30:5 | \ [EMAIL PROTECTED] | My own words only speak for me. / -- redhat-list mailing list unsubscribe mailto:[EMAIL PROTECTED]?subject=unsubscribe https://listman.redhat.com/mailman/listinfo/redhat-list -- redhat-list mailing list unsubscribe mailto:[EMAIL PROTECTED]?subject=unsubscribe https://listman.redhat.com/mailman/listinfo/redhat-list