Analyzing logs... We have TONS of stack traces similar to this:
Nov 27 19:06:49 sgsxeris001 kernel: [522969.113150] general protection fault: 0000 [#474] SMP Nov 27 19:06:49 sgsxeris001 kernel: [522969.113341] CPU 35 Nov 27 19:06:49 sgsxeris001 kernel: [522969.115290] Nov 27 19:06:49 sgsxeris001 kernel: [522969.115361] Pid: 63574, comm: make Tainted: G D 3.2.0-67-generic #101-Ubuntu HP ProLiant DL380p Gen8 Nov 27 19:06:49 sgsxeris001 kernel: [522969.115567] RIP: 0010:[<ffffffff8116616e>] [<ffffffff8116616e>] kmem_cache_alloc_trace+0x5e/0x140 ... Nov 27 19:06:49 sgsxeris001 kernel: [522969.116824] Stack: ... Meaning that ALL processes that were scheduled on CPU 35 and executed either: RIP = kmem_cache_alloc_trace+0x5e/0x140 OR RIP = __kmalloc+0x7b/0x190 (RIP = Instruction Pointer) Caused the CPU to have a Protection Fault. Protection faults can lead system to HANG in cause of double or triple faults to happen (the second/third happen while the first one is being handled by Linux exception handler). inaddy@workstation:~/.../var/log$ cat syslog | egrep "RIP:" | wc -l 2632 2632 is the number of times a process caused a Protection Fault (all of them on CPU 35) when scheduled to CPU 35. Following these 2 Instruction Pointers... (from kmem_cache_alloc_trace AND __kmalloc), both of them are in the same piece of code (and instructions): 2325 if (unlikely(!irqsafe_cpu_cmpxchg_double( 0xffffffff81166576 <+86>: mov (%r12),%rsi 0xffffffff8116657e <+94>: mov 0x0(%r13,%rax,1),%rbx 0xffffffff81166583 <+99>: mov %r13,%rax 0xffffffff81166586 <+102>: callq 0xffffffff8131cb20 0xffffffff8116658b <+107>: data32 xchg %ax,%ax 0xffffffff8116658e <+110>: test %al,%al 0xffffffff81166590 <+112>: je 0xffffffff81166554 <kmem_cache_alloc_trace+52> 2325 if (unlikely(!irqsafe_cpu_cmpxchg_double( 0xffffffff81166113 <+115>: mov (%r12),%rsi 0xffffffff8116611b <+123>: mov 0x0(%r13,%rax,1),%rbx 0xffffffff81166120 <+128>: mov %r13,%rax 0xffffffff81166123 <+131>: callq 0xffffffff8131cb20 0xffffffff81166128 <+136>: data32 xchg %ax,%ax 0xffffffff8116612b <+139>: test %al,%al 0xffffffff8116612d <+141>: je 0xffffffff811660f1 <__kmalloc+81> -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1398497 Title: HP Proliant Serverrs - DL360 and DL380 Gen8 - Precise Kernel Panic - General Protection Fault Status in linux package in Ubuntu: Incomplete Status in linux source package in Precise: Incomplete Bug description: It was brought to my attention the following situation: """ We massively upgraded our Ubuntu 12.04 servers (most of them are HP DL360p Gen8 or DL380 Gen8) to 3.2.0-67 kernel And in the last 2-3 days we already had to reboot 5 of them because they completely hang Some of them had the following messages under syslog : kernel: [384707.675479] general protection fault: 0000 [#5666] SMP others had : kernel: [950725.612724] BUG: unable to handle kernel paging request All of them have this also : your BIOS is broken and requested that x2apic be disabled """ Comments bellow To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1398497/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp