Hi folks, I have a problem that's now beyond my expertise to fault properly. I get random intermittent kernel errors. Usually when the system is under stress.
System specs; AMD X4 840 (Badged phenomii but it's really an athlon core) ASUS M4A88TD-M EVO/USB3 2x 2GB sticks of Corsair 1600 DDR3 500TB WD Caviar Blue. Below are some example of the errors. square kernel: [ 683.271626] Pid: 6593, comm: rsync Tainted: P D 2.6.32-5-amd64 #1 Apr 24 14:51:38 square kernel: [ 683.271631] Call Trace: Apr 24 14:51:38 square kernel: [ 683.271648] [<ffffffff810cad37>] ? print_bad_pte+0x232/0x24a Apr 24 14:51:38 square kernel: [ 683.271660] [<ffffffff810cbde7>] ? unmap_vmas+0x62d/0x931 Apr 24 14:51:38 square kernel: [ 683.271672] [<ffffffff8118e194>] ? cpumask_any_but+0x28/0x34 Apr 24 14:51:38 square kernel: [ 683.271682] [<ffffffff810d04c4>] ? exit_mmap+0xc4/0x148 Apr 24 14:51:38 square kernel: [ 683.271690] [<ffffffff8104bc6d>] ? mmput+0x3c/0xdf Apr 24 14:51:38 square kernel: [ 683.271698] [<ffffffff8104f866>] ? exit_mm+0x102/0x10d Apr 24 14:51:38 square kernel: [ 683.271705] [<ffffffff8105128b>] ? do_exit+0x1f8/0x6c6 Apr 24 14:51:38 square kernel: [ 683.271712] [<ffffffff810517cf>] ? do_group_exit+0x76/0x9d Apr 24 14:51:38 square kernel: [ 683.271720] [<ffffffff81051808>] ? sys_exit_group+0x12/0x16 Apr 24 14:51:38 square kernel: [ 683.271727] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b Apr 24 14:51:44 square kerneloops: Submitted 1 kernel oopses to www.kerneloops.org Another from minecraft; d: 6742, comm: java Tainted: P B D 2.6.32-5-amd64 #1 Apr 24 15:12:02 square kernel: [ 1907.726033] Call Trace: Apr 24 15:12:02 square kernel: [ 1907.726039] [<ffffffff810b7a11>] ? bad_page+0x116/0x129 Apr 24 15:12:02 square kernel: [ 1907.726042] [<ffffffff810b9b2e>] ? get_page_from_freelist+0x4fd/0x760 Apr 24 15:12:02 square kernel: [ 1907.726098] [<ffffffffa0246f02>] ? firegl_trace+0x72/0x1e0 [fglrx] Apr 24 15:12:02 square kernel: [ 1907.726100] [<ffffffff810ba0f8>] ? __alloc_pages_nodemask+0x11c/0x5f4 Apr 24 15:12:02 square kernel: [ 1907.726104] [<ffffffff81036605>] ? native_flush_tlb_others+0xb6/0xe3 Apr 24 15:12:02 square kernel: [ 1907.726107] [<ffffffff810bc479>] ? ____pagevec_lru_add+0x160/0x176 Apr 24 15:12:02 square kernel: [ 1907.726110] [<ffffffff810cc981>] ? handle_mm_fault+0x27a/0x80f Apr 24 15:12:02 square kernel: [ 1907.726113] [<ffffffff812fe6b6>] ? do_page_fault+0x2e0/0x2fc Apr 24 15:12:02 square kernel: [ 1907.726116] [<ffffffff812fc555>] ? page_fault+0x25/0x30 Another one from stress. stress D 0000000000000000 0 5972 5963 0x00000000 Apr 25 21:16:11 square kernel: [ 360.740389] ffff88011b04dbd0 0000000000000082 ffff880114f40150 000000000000000e Apr 25 21:16:11 square kernel: [ 360.740392] 0007ffffffffffff 0000000000000000 000000000000f9e0 ffff880100329fd8 Apr 25 21:16:11 square kernel: [ 360.740395] 0000000000015780 0000000000015780 ffff88011b04f100 ffff88011b04f3f8 Apr 25 21:16:11 square kernel: [ 360.740397] Call Trace: Apr 25 21:16:11 square kernel: [ 360.740404] [<ffffffff8104001f>] ? check_preempt_wakeup+0x1dd/0x268 Apr 25 21:16:11 square kernel: [ 360.740408] [<ffffffff812fb65b>] ? __mutex_lock_common+0x122/0x192 Apr 25 21:16:11 square kernel: [ 360.740411] [<ffffffff810493e0>] ? update_rq_clock+0xf/0x28 Apr 25 21:16:11 square kernel: [ 360.740413] [<ffffffff812fb783>] ? mutex_lock+0x1a/0x31 Apr 25 21:16:11 square kernel: [ 360.740416] [<ffffffff8110be35>] ? sync_filesystems+0x13/0xe3 Apr 25 21:16:11 square kernel: [ 360.740418] [<ffffffff8110bf4a>] ? sys_sync+0x1c/0x2e Apr 25 21:16:11 square kernel: [ 360.740420] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b Apr 25 21:18:11 square kernel: [ 480.740375] stress D ffff8800cf609c40 0 5965 5963 0x00000000 Apr 25 21:18:11 square kernel: [ 480.740378] ffff8800cf609c40 0000000000000086 ffffffff810414d5 000000010000000e Apr 25 21:18:11 square kernel: [ 480.740381] 0000000000015780 ffff880100383e68 000000000000f9e0 ffff880100383fd8 Apr 25 21:18:11 square kernel: [ 480.740383] 0000000000015780 0000000000015780 ffff8800cf60f100 ffff8800cf60f3f8 Apr 25 21:18:11 square kernel: [ 480.740385] Call Trace: Apr 25 21:18:11 square kernel: [ 480.740392] [<ffffffff810414d5>] ? select_task_rq_fair+0x472/0x836 Apr 25 21:18:11 square kernel: [ 480.740395] [<ffffffff8101650e>] ? native_sched_clock+0x2e/0x66 Apr 25 21:18:11 square kernel: [ 480.740397] [<ffffffff8103fc8e>] ? update_curr+0xa6/0x147 Apr 25 21:18:11 square kernel: [ 480.740399] [<ffffffff8101654b>] ? sched_clock+0x5/0x8 Apr 25 21:18:11 square kernel: [ 480.740402] [<ffffffff812fb65b>] ? __mutex_lock_common+0x122/0x192 Apr 25 21:18:11 square kernel: [ 480.740404] [<ffffffff812fb783>] ? mutex_lock+0x1a/0x31 Apr 25 21:18:11 square kernel: [ 480.740407] [<ffffffff8110be35>] ? sync_filesystems+0x13/0xe3 Apr 25 21:18:11 square kernel: [ 480.740409] [<ffffffff8110bf40>] ? sys_sync+0x12/0x2e Apr 25 21:18:11 square kernel: [ 480.740411] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b My attempts at troubleshooting this have been like so; 1) Compile kernels and flightgear. Usually fails after 10 mins or so. 2) Remove one mem stick, swap with other. Try different slots. It fails "less often" with one stick than with both. 3) Memtest86+ shows both sticks to be ok. 4) Ran "stress". This fails more often if I enable hdd tests but it still fails. 5) Installed fedora to prove it's not just a Debian thing. Errors are the exact same under fedora. I'm at a loss as to what it could be and would like to determine at least something before I start throwing money around. All I have left if that some incompatibility between mobo/mem/cpu/disk is causing this. Does anyone have any advice on what tools I can use to narrow it down more or eliminate certain components?
signature.asc
Description: This is a digitally signed message part