[Kernel-packages] [Bug 1398497] Re: HP Proliant Serverrs - DL360 and DL380 Gen8 - Precise Kernel Panic - General Protection Fault

Rafael David Tinoco Tue, 02 Dec 2014 11:04:16 -0800

Analyzing logs...

We have TONS of stack traces similar to this:


Nov 27 19:06:49 sgsxeris001 kernel: [522969.113150] general protection fault: 
0000 [#474] SMP
Nov 27 19:06:49 sgsxeris001 kernel: [522969.113341] CPU 35
Nov 27 19:06:49 sgsxeris001 kernel: [522969.115290]
Nov 27 19:06:49 sgsxeris001 kernel: [522969.115361] Pid: 63574, comm: make 
Tainted: G D 3.2.0-67-generic #101-Ubuntu HP ProLiant DL380p Gen8
Nov 27 19:06:49 sgsxeris001 kernel: [522969.115567] RIP: 
0010:[<ffffffff8116616e>] [<ffffffff8116616e>] kmem_cache_alloc_trace+0x5e/0x140
...
Nov 27 19:06:49 sgsxeris001 kernel: [522969.116824] Stack:
...

Meaning that ALL processes that were scheduled on CPU 35 and executed
either:

RIP = kmem_cache_alloc_trace+0x5e/0x140 OR
RIP = __kmalloc+0x7b/0x190

(RIP = Instruction Pointer)

Caused the CPU to have a Protection Fault. Protection faults can lead
system to HANG in cause of double or triple faults to happen (the
second/third happen while the first one is being handled by Linux
exception handler).

inaddy@workstation:~/.../var/log$ cat syslog | egrep "RIP:" | wc -l
2632

2632 is the number of times a process caused a Protection Fault (all of
them on CPU 35) when scheduled to CPU 35.

Following these 2 Instruction Pointers... (from kmem_cache_alloc_trace
AND __kmalloc), both of them are in the same piece of code (and
instructions):

2325 if (unlikely(!irqsafe_cpu_cmpxchg_double(
0xffffffff81166576 <+86>: mov (%r12),%rsi
0xffffffff8116657e <+94>: mov 0x0(%r13,%rax,1),%rbx
0xffffffff81166583 <+99>: mov %r13,%rax
0xffffffff81166586 <+102>: callq 0xffffffff8131cb20
0xffffffff8116658b <+107>: data32 xchg %ax,%ax
0xffffffff8116658e <+110>: test %al,%al
0xffffffff81166590 <+112>: je 0xffffffff81166554 <kmem_cache_alloc_trace+52>

2325 if (unlikely(!irqsafe_cpu_cmpxchg_double(
0xffffffff81166113 <+115>: mov (%r12),%rsi
0xffffffff8116611b <+123>: mov 0x0(%r13,%rax,1),%rbx
0xffffffff81166120 <+128>: mov %r13,%rax
0xffffffff81166123 <+131>: callq 0xffffffff8131cb20
0xffffffff81166128 <+136>: data32 xchg %ax,%ax
0xffffffff8116612b <+139>: test %al,%al
0xffffffff8116612d <+141>: je 0xffffffff811660f1 <__kmalloc+81>

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1398497

Title:
  HP Proliant Serverrs - DL360 and DL380 Gen8 - Precise Kernel Panic -
  General Protection Fault

Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Precise:
  Incomplete

Bug description:
  It was brought to my attention the following situation:

  """
  We massively upgraded our Ubuntu 12.04 servers (most of them are HP
  DL360p Gen8 or DL380 Gen8) to 3.2.0-67 kernel And in the last 2-3
  days we already had to reboot 5 of them because they completely hang

  Some of them had the following messages under syslog :
  kernel: [384707.675479] general protection fault: 0000 [#5666] SMP

  others had :
  kernel: [950725.612724] BUG: unable to handle kernel paging request

  All of them have this also :
  your BIOS is broken and requested that x2apic be disabled
  """

  Comments bellow

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1398497/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1398497] Re: HP Proliant Serverrs - DL360 and DL380 Gen8 - Precise Kernel Panic - General Protection Fault

Reply via email to