[Kernel-packages] [Bug 1676678] Re: ISST-LTE:dotg6:Kernel access of bad area, sig: 11 - during stress tests

Frank Heimes Fri, 15 Sep 2017 11:12:50 -0700

** Changed in: linux (Ubuntu)
       Status: Incomplete => Invalid

** Changed in: ubuntu-power-systems
       Status: Incomplete => Invalid


-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1676678

Title:
  ISST-LTE:dotg6:Kernel access of bad area, sig: 11 - during stress
  tests

Status in The Ubuntu-power-systems project:
  Invalid
Status in linux package in Ubuntu:
  Invalid

Bug description:
  ---Problem Description---
  After running stress tests (IO, TCP, BASE) for a few hours, Ubuntu 17.04 KVM 
guest dotg6 crashed, produced a kdump, and rebooted.
   
  ---uname output---
  Linux dotg6 4.10.0-13-generic #15-Ubuntu SMP Thu Mar 9 20:27:28 UTC 2017 
ppc64le ppc64le ppc64le GNU/Linux
   
  Machine Type = KVM guest on a 8247-22L (host also running Ubuntu 17.04) 
   
   Stack trace output:
   [ 1909.621800] Oops: Kernel access of bad area, sig: 11 [#1]
  [ 1909.621870] SMP NR_CPUS=2048
  [ 1909.621871] NUMA
  [ 1909.621925] pSeries
  [ 1909.622016] Modules linked in: minix nls_iso8859_1 rpcsec_gss_krb5 
auth_rpcgss nfsv4 nfs lockd grace fscache binfmt_misc xfs libcrc32c vmx_crypto 
sunrpc ip_tables x_tables autofs4 btrfs xor raid6_pq dm_service_time 
crc32c_vpmsum virtio_scsi virtio_net scsi_dh_emc scsi_dh_rdac scsi_dh_alua 
dm_multipath
  [ 1909.622401] CPU: 2 PID: 27704 Comm: ppc64_cpu Not tainted 
4.10.0-13-generic #15-Ubuntu
  [ 1909.622536] task: c000000042a64200 task.stack: c00000003423c000
  [ 1909.622627] NIP: d0000000016a14f4 LR: d0000000016a14a0 CTR: 
c000000000609d00
  [ 1909.622737] REGS: c00000003423f7f0 TRAP: 0380   Not tainted  
(4.10.0-13-generic)
  [ 1909.622850] MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>
  [ 1909.622860]   CR: 24002428  XER: 20000000
  [ 1909.623016] CFAR: c00000000061a238 SOFTE: 1
  [ 1909.623016] GPR00: d0000000016a14a0 c00000003423fa70 d0000000016ab8cc 
c000000170fd5000
  [ 1909.623016] GPR04: ffffffffffffffff 0000000000000000 0000000000000000 
0000000000007530
  [ 1909.623016] GPR08: c00000000146c700 757465736d642f6e c00000000146dbe0 
d0000000016a2ef8
  [ 1909.623016] GPR12: c000000000609d00 c000000001b81200 0000000000000008 
0000000000000001
  [ 1909.623016] GPR16: 0000000000000000 0000000000000000 0000000000000000 
0000000046f05f80
  [ 1909.623016] GPR20: 0000000046f061f8 0000000000000000 0000000046f05f58 
c00000017fd4a808
  [ 1909.623016] GPR24: 0000000000000001 c000000170fc7a30 c000000001326eb0 
d0000000016a1648
  [ 1909.623016] GPR28: c000000001471c28 c000000170fc7860 0000000071ae4a20 
0000000000000058
  [ 1909.623990] NIP [d0000000016a14f4] __virtscsi_set_affinity+0xac/0x200 
[virtio_scsi]
  [ 1909.624114] LR [d0000000016a14a0] __virtscsi_set_affinity+0x58/0x200 
[virtio_scsi]
  [ 1909.624235] Call Trace:
  [ 1909.624278] [c00000003423fa70] [d0000000016a14a0] 
__virtscsi_set_affinity+0x58/0x200 [virtio_scsi] (unreliable)
  [ 1909.624445] [c00000003423fac0] [d0000000016a1678] 
virtscsi_cpu_online+0x30/0x70 [virtio_scsi]
  [ 1909.624746] [c00000003423fae0] [c0000000000db73c] 
cpuhp_invoke_callback+0x3ec/0x5a0
  [ 1909.624887] [c00000003423fb50] [c0000000000dba88] 
cpuhp_down_callbacks+0x78/0xf0
  [ 1909.625037] [c00000003423fba0] [c000000000268bb0] _cpu_down+0x150/0x1b0
  [ 1909.625174] [c00000003423fc00] [c0000000000de1b4] do_cpu_down+0x64/0xb0
  [ 1909.625330] [c00000003423fc40] [c00000000074b834] 
cpu_subsys_offline+0x24/0x40
  [ 1909.625485] [c00000003423fc60] [c000000000743284] device_offline+0xf4/0x130
  [ 1909.625610] [c00000003423fca0] [c000000000743434] online_store+0x64/0xb0
  [ 1909.625736] [c00000003423fce0] [c00000000073e37c] dev_attr_store+0x3c/0x60
  [ 1909.625862] [c00000003423fd00] [c0000000003faa18] sysfs_kf_write+0x68/0xa0
  [ 1909.625984] [c00000003423fd20] [c0000000003f98bc] 
kernfs_fop_write+0x17c/0x250
  [ 1909.626132] [c00000003423fd70] [c00000000033c98c] __vfs_write+0x3c/0x70
  [ 1909.626253] [c00000003423fd90] [c00000000033e414] vfs_write+0xd4/0x240
  [ 1909.626374] [c00000003423fde0] [c00000000033ffc8] SyS_write+0x68/0x110
  [ 1909.626501] [c00000003423fe30] [c00000000000b184] system_call+0x38/0xe0
  [ 1909.626624] Instruction dump:
  [ 1909.626691] 2f890000 419e0064 3be00000 393f0021 3880ffff 792926e4 7d3d4a14 
e9290010
  [ 1909.626835] 2fa90000 7d234b78 419e002c e9290020 <e9290330> e9290058 
2fa90000 7d2c4b78
  [ 1909.627003] ---[ end trace ecc8a323beb021a2 ]---

   
  crash>  bt
  PID: 27704  TASK: c000000042a64200  CPU: 2   COMMAND: "ppc64_cpu"
   #0 [c00000003423f630] crash_kexec at c0000000001a04c4
   #1 [c00000003423f670] oops_end at c000000000024da8
   #2 [c00000003423f6f0] bad_page_fault at c0000000000627b0
   #3 [c00000003423f760] slb_miss_bad_addr at c000000000026828
   #4 [c00000003423f780] bad_addr_slb at c000000000008acc
   Data SLB Access [380] exception frame:
   R0:  d0000000016a14a0    R1:  c00000003423fa70    R2:  d0000000016ab8cc   
   R3:  c000000170fd5000    R4:  ffffffffffffffff    R5:  0000000000000000   
   R6:  0000000000000000    R7:  0000000000007530    R8:  c00000000146c700   
   R9:  757465736d642f6e    R10: c00000000146dbe0    R11: d0000000016a2ef8   
   R12: c000000000609d00    R13: c000000001b81200    R14: 0000000000000008   
   R15: 0000000000000001    R16: 0000000000000000    R17: 0000000000000000   
   R18: 0000000000000000    R19: 0000000046f05f80    R20: 0000000046f061f8   
   R21: 0000000000000000    R22: 0000000046f05f58    R23: c00000017fd4a808   
   R24: 0000000000000001    R25: c000000170fc7a30    R26: c000000001326eb0   
   R27: d0000000016a1648    R28: c000000001471c28    R29: c000000170fc7860   
   R30: 0000000071ae4a20    R31: 0000000000000058   
   NIP: d0000000016a14f4    MSR: 800000000280b033    OR3: c00000000061a238
   CTR: c000000000609d00    LR:  d0000000016a14a0    XER: 0000000020000000
   CCR: 0000000024002428    MQ:  0000000000000001    DAR: 757465736d64329e
   DSISR: c00000000001b910     Syscall Result: 0000000000000000
   #5 [c00000003423fa70] __virtscsi_set_affinity at d0000000016a14f4 
[virtio_scsi]
   [Link Register] [c00000003423fa70] __virtscsi_set_affinity at 
d0000000016a14a0  (unreliable)
   #6 [c00000003423fac0] virtscsi_cpu_online at d0000000016a1678 [virtio_scsi]
   #7 [c00000003423fae0] cpuhp_invoke_callback at c0000000000db73c
   #8 [c00000003423fb50] cpuhp_down_callbacks at c0000000000dba88
   #9 [c00000003423fba0] _cpu_down at c000000000268bb0
  #10 [c00000003423fc00] do_cpu_down at c0000000000de1b4
  #11 [c00000003423fc40] cpu_subsys_offline at c00000000074b834
  #12 [c00000003423fc60] device_offline at c000000000743284
  #13 [c00000003423fca0] online_store at c000000000743434
  #14 [c00000003423fce0] dev_attr_store at c00000000073e37c
  #15 [c00000003423fd00] sysfs_kf_write at c0000000003faa18
  #16 [c00000003423fd20] kernfs_fop_write at c0000000003f98bc
  #17 [c00000003423fd70] __vfs_write at c00000000033c98c
  #18 [c00000003423fd90] vfs_write at c00000000033e414
  #19 [c00000003423fde0] sys_write at c00000000033ffc8
  #20 [c00000003423fe30] system_call at c00000000000b184
   System Call [c01] exception frame:
   R0:  0000000000000004    R1:  00003ffff823a5c0    R2:  00003fff7bf57f00   
   R3:  0000000000000008    R4:  0000010029020080    R5:  0000000000000001   
   R6:  00003fff7bee0d2c    R7:  0000010029020010    R8:  0000000000000000   
   R9:  0000000000000000    R10: 0000000000000000    R11: 0000000000000000   
   R12: 0000000000000000    R13: 00003fff7bfed060   
   NIP: 00003fff7bf350cc    MSR: 800000000280f033    OR3: 0000000000000008
   CTR: 0000000000000000    LR:  0000000046f01e0c    XER: 0000000000000000
   CCR: 0000000048000484    MQ:  0000000000000001    DAR: 00003fff7bd7e2c8
   DSISR: 0000000040000000     Syscall Result: 0000000000000008

  The initial part while invoking the crash tool on the vmcore :
        KERNEL: /usr/lib/debug/boot/vmlinux-4.10.0-13-generic
      DUMPFILE: /var/crash/201703221034/dump.201703221034  [PARTIAL DUMP]
          CPUS: 7
          DATE: Wed Mar 22 10:34:11 2017
        UPTIME: 00:11:42
  LOAD AVERAGE: 35.29, 25.73, 15.42
         TASKS: 704
      NODENAME: dotg6
       RELEASE: 4.10.0-13-generic
       VERSION: #15-Ubuntu SMP Thu Mar 9 20:27:28 UTC 2017
       MACHINE: ppc64le  (3425 Mhz)
        MEMORY: 6 GB
         PANIC: "Unable to handle kernel paging request for data at address 
0x757465736d64329e"
           PID: 27704
       COMMAND: "ppc64_cpu"
          TASK: c000000042a64200  [THREAD_INFO: c00000003423c000]
           CPU: 2
         STATE: TASK_RUNNING (PANIC)

  > Can this problem be reproduced with some certainty ? If so, I could probably
  > provide a debug patch to the guest kernel and collect some information when
  > this happens.

  This guest seems to have crashed twice with this error now with the
  same backtrace, so it seems likely that it will occur again, but
  there's no specific timeframe for a crash.

  There is a test running on this guest which periodically turns SMT on
  and off, and it's possible that the SMT test is triggering this crash.
  Causing the SMT test to run more frequently may also trigger this
  crash more consistently.

  Mirroring to Canonical for their awareness while IBM continues
  investigation...

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1676678/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1676678] Re: ISST-LTE:dotg6:Kernel access of bad area, sig: 11 - during stress tests

Reply via email to