------- Comment From prad...@us.ibm.com 2018-04-21 01:53 EDT-------
Looks like an Oops similar to the previous one in comment#39 starting a 
sequence of events

root@boslcp3:~# [ 2837.030181] Unable to handle kernel paging request for data 
at address 0x00000008
[ 2837.030253] Faulting instruction address: 0xc0000000001336fc
[ 2837.030295] Oops: Kernel access of bad area, sig: 11 [#1]
[ 2837.030328] LE SMP NR_CPUS=2048 NUMA PowerNV
[ 2837.030364] Modules linked in: vhost_net vhost macvtap macvlan tap 
xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat 
nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack 
libcrc32c ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter 
ebtables devlink ip6table_filter ip6_tables iptable_filter rpcsec_gss_krb5 
nfsv4 nfs fscache kvm_hv binfmt_misc kvm dm_service_time dm_multipath 
scsi_dh_rdac scsi_dh_emc scsi_dh_alua joydev input_leds idt_89hpesx mac_hid 
vmx_crypto crct10dif_vpmsum at24 ofpart cmdlinepart uio_pdrv_genirq uio 
powernv_flash mtd ibmpowernv ipmi_powernv ipmi_devintf ipmi_msghandler opal_prd 
nfsd auth_rpcgss nfs_acl lockd grace sunrpc sch_fq_codel ip_tables x_tables 
autofs4 btrfs xor zstd_compress raid6_pq ses enclosure hid_generic
[ 2837.030909]  usbhid hid qla2xxx ast i2c_algo_bit ttm ixgbe drm_kms_helper 
mpt3sas nvme_fc syscopyarea sysfillrect nvme_fabrics sysimgblt fb_sys_fops 
nvme_core raid_class crc32c_vpmsum drm i40e scsi_transport_sas 
scsi_transport_fc mdio aacraid
[ 2837.031053] CPU: 145 PID: 1182 Comm: kworker/145:1 Not tainted 
4.15.0-18-generic #19
[ 2837.031107] NIP:  c0000000001336fc LR: c000000000133cf8 CTR: c000000000cfefa0
[ 2837.031156] REGS: c000200e44c77a10 TRAP: 0300   Not tainted  
(4.15.0-18-generic)
[ 2837.031204] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28000822  
XER: 00000000
[ 2837.031257] CFAR: c000000000133cf4 DAR: 0000000000000008 DSISR: 40000000 
SOFTE: 0
[ 2837.031257] GPR00: c000000000133cf8 c000200e44c77c90 c0000000016eae00 
c000200e44bda5c0
[ 2837.031257] GPR04: c000000fdf6f7da0 c000200e618f7da0 c000200e618fa305 
c000000fdf6f7cc8
[ 2837.031257] GPR08: c000200e6190c960 0000000000002440 0000000000000000 
c00800000f04e0f8
[ 2837.031257] GPR12: 0000000000000000 c000000007a83b00 c00000000013c788 
c000200e50ebf3c0
[ 2837.031257] GPR16: 0000000000000000 0000000000000000 0000000000000000 
0000000000000000
[ 2837.031257] GPR20: c000200e618f7d80 0000000000000000 0000000000000000 
fffffffffffffef7
[ 2837.031257] GPR24: 0000000000000402 0000000000000000 c000200e618f8100 
c000000001713b00
[ 2837.031257] GPR28: c000200e618f7da0 0000000000000000 c000200e618f7d80 
c000200e44bda5c0
[ 2837.031687] NIP [c0000000001336fc] process_one_work+0x3c/0x5a0
[ 2837.031727] LR [c000000000133cf8] worker_thread+0x98/0x630
[ 2837.031760] Call Trace:
[ 2837.031778] [c000200e44c77c90] [c000000000133974] 
process_one_work+0x2b4/0x5a0 (unreliable)
[ 2837.031828] [c000200e44c77d20] [c000000000133cf8] worker_thread+0x98/0x630
[ 2837.031885] [c000200e44c77dc0] [c00000000013c928] kthread+0x1a8/0x1b0
[ 2837.031928] [c000200e44c77e30] [c00000000000b528] 
ret_from_kernel_thread+0x5c/0xb4
[ 2837.031976] Instruction dump:
[ 2837.032001] 60000000 7d908026 fba1ffe8 fbc1fff0 91810008 f821ff71 e9240000 
712a0004
[ 2837.032052] 793d05e4 40820008 3ba00000 ebc30048 <e93d0008> 815e0010 81290100 
714a0004
[ 2837.032104] ---[ end trace ae121b1a8fbe89f8 ]---

A cascading series of events follow ending up in hard lockups. However,
that likely happens when  IPIs fail and these are secondary events.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  Triaged
Status in linux source package in Bionic:
  Triaged

Bug description:
  Problem Description:
  ===================
  Host crashed & enters into xmon after updating to  4.15.0-15.16 kernel kernel.

  Steps to re-create:
  ==================

  1. boslcp3 is up with BMC:118 & PNOR: 20180330 levels
  2. Installed boslcp3 with latest kernel 
      4.15.0-13-generic 
  3. Enabled "-proposed" kernel in /etc/apt/sources.list file
  4. Ran sudo apt-get update & apt-get upgrade

  5. root@boslcp3:~# ls /boot
  abi-4.15.0-13-generic         retpoline-4.15.0-13-generic
  abi-4.15.0-15-generic         retpoline-4.15.0-15-generic
  config-4.15.0-13-generic      System.map-4.15.0-13-generic
  config-4.15.0-15-generic      System.map-4.15.0-15-generic
  grub                          vmlinux
  initrd.img                    vmlinux-4.15.0-13-generic
  initrd.img-4.15.0-13-generic  vmlinux-4.15.0-15-generic
  initrd.img-4.15.0-15-generic  vmlinux.old
  initrd.img.old

  6. Rebooted & booted with 4.15.0-15 kernel
  7. Enabled xmon by editing file "vi /etc/default/grub" and ran update-grub
  8. Rebooted host.
  9. Booted with 4.15.0-15  & provided root/password credentials in login 
prompt 

  10. Host crashed & enters into XMON state with 'Unable to handle
  kernel paging request'

  root@boslcp3:~# [   66.295233] Unable to handle kernel paging request for 
data at address 0x8882f6ed90e9151a
  [   66.295297] Faulting instruction address: 0xc00000000038a110
  cpu 0x50: Vector: 380 (Data Access Out of Range) at [c00000000692f650]
      pc: c00000000038a110: kmem_cache_alloc_node+0x2f0/0x350
      lr: c00000000038a0fc: kmem_cache_alloc_node+0x2dc/0x350
      sp: c00000000692f8d0
     msr: 9000000000009033
     dar: 8882f6ed90e9151a
    current = 0xc00000000698fd00
    paca    = 0xc00000000fab7000   softe: 0        irq_happened: 0x01
      pid   = 1762, comm = systemd-journal
  Linux version 4.15.0-15-generic (buildd@bos02-ppc64el-002) (gcc version 7.3.0 
(Ubuntu 7.3.0-14ubuntu1)) #16-Ubuntu SMP Wed Apr 4 13:57:51 UTC 2018 (Ubuntu 
4.15.0-15.16-generic 4.15.15)
  enter ? for help
  [c00000000692f8d0] c000000000389fd4 kmem_cache_alloc_node+0x1b4/0x350 
(unreliable)
  [c00000000692f940] c000000000b2ec6c __alloc_skb+0x6c/0x220
  [c00000000692f9a0] c000000000b30b6c alloc_skb_with_frags+0x7c/0x2e0
  [c00000000692fa30] c000000000b247cc sock_alloc_send_pskb+0x29c/0x2c0
  [c00000000692fae0] c000000000c5705c unix_dgram_sendmsg+0x15c/0x8f0
  [c00000000692fbc0] c000000000b1ec64 sock_sendmsg+0x64/0x90
  [c00000000692fbf0] c000000000b20abc ___sys_sendmsg+0x31c/0x390
  [c00000000692fd90] c000000000b221ec __sys_sendmsg+0x5c/0xc0
  [c00000000692fe30] c00000000000b184 system_call+0x58/0x6c
  --- Exception: c00 (System Call) at 000074826f6fa9c4
  SP (7ffff5dc5510) is in userspace
  50:mon>
  50:mon>

  10. Attached Host console logs

  I rebooted the host just to see if it would hit the issue again and
  this time I didn't even get to the login prompt but it crashed in the
  same location:

  50:mon> r
  R00 = c000000000389fd4   R16 = c000200e0b20fdc0
  R01 = c000200e0b20f8d0   R17 = 0000000000000048
  R02 = c0000000016eb400   R18 = 000000000001fe80
  R03 = 0000000000000001   R19 = 0000000000000000
  R04 = 0048ca1cff37803d   R20 = 0000000000000000
  R05 = 0000000000000688   R21 = 0000000000000000
  R06 = 0000000000000001   R22 = 0000000000000048
  R07 = 0000000000000687   R23 = 4882d6e3c8b7ab55
  R08 = 48ca1cff37802b68   R24 = c000200e5851df01
  R09 = 0000000000000000   R25 = 8882f6ed90e67454
  R10 = 0000000000000000   R26 = c000000000b2ec6c
  R11 = c000000000d10f78   R27 = c000000ff901ee00
  R12 = 0000000000002000   R28 = ffffffffffffffff
  R13 = c00000000fab7000   R29 = 00000000015004c0
  R14 = c000200e4c973fc8   R30 = c000200e5851df01
  R15 = c000200e4c974238   R31 = c000000ff901ee00
  pc  = c00000000038a110 kmem_cache_alloc_node+0x2f0/0x350
  cfar= c000000000016e1c arch_local_irq_restore+0x1c/0x90
  lr  = c00000000038a0fc kmem_cache_alloc_node+0x2dc/0x350
  msr = 9000000000009033   cr  = 28002844
  ctr = c00000000061e1b0   xer = 0000000000000000   trap =  380
  dar = 8882f6ed90e67454   dsisr = c000200e40bd8400
  50:mon> t
  [c000200e0b20f8d0] c000000000389fd4 kmem_cache_alloc_node+0x1b4/0x350 
(unreliable)
  [c000200e0b20f940] c000000000b2ec6c __alloc_skb+0x6c/0x220
  [c000200e0b20f9a0] c000000000b30b6c alloc_skb_with_frags+0x7c/0x2e0
  [c000200e0b20fa30] c000000000b247cc sock_alloc_send_pskb+0x29c/0x2c0
  [c000200e0b20fae0] c000000000c56ae4 unix_stream_sendmsg+0x264/0x5c0
  [c000200e0b20fbc0] c000000000b1ec64 sock_sendmsg+0x64/0x90
  [c000200e0b20fbf0] c000000000b20abc ___sys_sendmsg+0x31c/0x390
  [c000200e0b20fd90] c000000000b221ec __sys_sendmsg+0x5c/0xc0
  [c000200e0b20fe30] c00000000000b184 system_call+0x58/0x6c
  --- Exception: c01 (System Call) at 00007d16a993a940
  SP (7ffffbee2270) is in userspace

  Mirroring to Canonical to advise them that this might be possible
  regression. Didn't see any obvious changes in this area in the
  changelog published at
  https://launchpad.net/ubuntu/+source/linux/4.15.0-15.16 but it would
  be good to have Canonical help reviewing the deltas as we try to
  isolate this further.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to