I'm attaching the crash tool output from the 3.13 kernel dump.

Much likely related to the situation already found in the following case: 
-> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540

Handled by Chris Arges and I on LKML discussions with Ingo and Linus:
-> http://www.kernelhub.org/?p=2&msg=683682

FOR NOW, it is LIKELY that I'll rely on already known recommendations for 
Proliant (including the ones related to X2APIC mode): 
-> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1417580

So we can TRY TO GUARANTEE that there are no LOST IRQs (IPIs) using the
firmware you're using. Hopefully with the proper APIC mode set, like HP
recommends, we will not have those IPIs problems.

OBS: Whenever IPIs are lost (we've seen this on some nested KVMs and
some buggy HW)  we can be locked up in the SMP callback state machine.
This means that the state machine looses IPIs ACKs and the state machine
loops forever trying to shutdown the CPU for the SMP task queue to
continue.

I'll provide SOON a comment with SUGGESTIONS and asking for FEEDBACK.

################

For now, from the 3.13 kernel dump, the most interesting part:

We had 7 CPUs executing the migration kernel thread (for the SMP
callback state machine execution):

#### migration tasks (state machine loop)

>    93      2   4  ffff8808147b47d0  RU   0.0       0      0  [migration/4]
>   118      2   9  ffff881814a2c7d0  RU   0.0       0      0  [migration/9]
>   123      2  10  ffff88081404c7d0  RU   0.0       0      0  [migration/10]
>   128      2  11  ffff881814a4c7d0  RU   0.0       0      0  [migration/11]
>   138      2  13  ffff881814a647d0  RU   0.0       0      0  [migration/13]
>   165      2  18  ffff8810149ec7d0  RU   0.0       0      0  [migration/18]
>   195      2  24  ffff881014a647d0  RU   0.0       0      0  [migration/24]

This logic will try to migrate tasks from one CPU to another. In order
for that to happen they have to rely on the state machine logic of
shutting CPUs down before migrating the tasks (turning off IRQs, etc).
The state machine - shutting down the CPUs on phases - relies on the SMP
callbacks bellow.

We had 3 CPUs in a part of the kernel that we have already identified to
be problematic under certain conditions and/or HW.

** > 17247      1  23  ffff881007055fc0  RU   1.6 7358428 2192548  qemu-
system-x86

PID: 17247  TASK: ffff881007055fc0  CPU: 23  COMMAND: "qemu-system-x86"
 #0 [ffff88203eac6e58] crash_nmi_callback at ffffffff8103fb72
 #1 [ffff88203eac6e68] nmi_handle at ffffffff8171f188
 #2 [ffff88203eac6ec8] do_nmi at ffffffff8171f350
 #3 [ffff88203eac6ef0] end_repeat_nmi at ffffffff8171e5f1
    [exception RIP: generic_exec_single+130]
    RIP: ffffffff810db712  RSP: ffff8810ea7c96e0  RFLAGS: 00000202
    RAX: 0000000000000010  RBX: 0000000000000010  RCX: 0000000000000202
    RDX: ffff8810ea7c96e0  RSI: 0000000000000018  RDI: 0000000000000001
    RBP: ffffffff810db712   R8: ffffffff810db712   R9: 0000000000000018
    R10: ffff8810ea7c96e0  R11: 0000000000000202  R12: ffffffffffffffff
    R13: 0000000000000206  R14: 000000007bc87bc6  R15: ffff8814959f76c0
    ORIG_RAX: ffff8814959f76c0  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff8810ea7c96e0] generic_exec_single at ffffffff810db712

!!!! CSD_FLAG logic discussed with Linus

108             while (csd->flags & CSD_FLAG_LOCK)
   0xffffffff810db712 <+130>:   testb  $0x1,0x20(%rbx)
   0xffffffff810db716 <+134>:   jne    0xffffffff810db710 
<generic_exec_single+128>

109                     cpu_relax();
110     }

** > 21036      1  27  ffff8810b69947d0  RU   1.0 7484828 1401940  qemu-
system-x86

PID: 21036  TASK: ffff8810b69947d0  CPU: 27  COMMAND: "qemu-system-x86"
 #0 [ffff88203eb46e58] crash_nmi_callback at ffffffff8103fb72
 #1 [ffff88203eb46e68] nmi_handle at ffffffff8171f188
 #2 [ffff88203eb46ec8] do_nmi at ffffffff8171f350
 #3 [ffff88203eb46ef0] end_repeat_nmi at ffffffff8171e5f1
    [exception RIP: generic_exec_single+130]
    RIP: ffffffff810db712  RSP: ffff8814959f7670  RFLAGS: 00000202
    RAX: 0000000000000010  RBX: 0000000000000010  RCX: 0000000000000202
    RDX: ffff8814959f7670  RSI: 0000000000000018  RDI: 0000000000000001
    RBP: ffffffff810db712   R8: ffffffff810db712   R9: 0000000000000018
    R10: ffff8814959f7670  R11: 0000000000000202  R12: ffffffffffffffff
    R13: 0000000000000282  R14: 0000000000000000  R15: 0000000000000100
    ORIG_RAX: 0000000000000100  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff8814959f7670] generic_exec_single at ffffffff810db712

!!!! CSD_FLAG logic discussed with Linus

108             while (csd->flags & CSD_FLAG_LOCK)
   0xffffffff810db712 <+130>:   testb  $0x1,0x20(%rbx)
   0xffffffff810db716 <+134>:   jne    0xffffffff810db710 
<generic_exec_single+128>

109                     cpu_relax();
110     }

** > 18516      1  31  ffff881dd54a2fe0  RU   1.6 7358428 2192548  qemu-
system-x86

PID: 18516  TASK: ffff881dd54a2fe0  CPU: 31  COMMAND: "qemu-system-x86"
 #0 [ffff88203ebc6e58] crash_nmi_callback at ffffffff8103fb72
 #1 [ffff88203ebc6e68] nmi_handle at ffffffff8171f188
 #2 [ffff88203ebc6ec8] do_nmi at ffffffff8171f350
 #3 [ffff88203ebc6ef0] end_repeat_nmi at ffffffff8171e5f1
    [exception RIP: generic_exec_single+130]
    RIP: ffffffff810db712  RSP: ffff881dd55597a0  RFLAGS: 00000202
    RAX: 0000000000000010  RBX: 0000000000000010  RCX: 0000000000000202
    RDX: ffff881dd55597a0  RSI: 0000000000000018  RDI: 0000000000000001
    RBP: ffffffff810db712   R8: ffffffff810db712   R9: 0000000000000018
    R10: ffff881dd55597a0  R11: 0000000000000202  R12: ffffffffffffffff
    R13: 0000000000000206  R14: 000000007bca7bc8  R15: ffff8814959f76c0
    ORIG_RAX: ffff8814959f76c0  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff881dd55597a0] generic_exec_single at ffffffff810db712

!!!! CSD_FLAG logic discussed with Linus

108             while (csd->flags & CSD_FLAG_LOCK)
   0xffffffff810db712 <+130>:   testb  $0x1,0x20(%rbx)
   0xffffffff810db716 <+134>:   jne    0xffffffff810db710 
<generic_exec_single+128>

109                     cpu_relax();
110     }

** Attachment removed: "lp1505564-3.13-kdump-crash-output.txt"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1505564/+attachment/4509470/+files/lp1505564-3.13-kdump-crash-output.txt

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1505564

Title:
  Soft lockup with "block nbdX: Attempted send on closed socket" spam

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1505564/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to