------- Comment From gws...@au1.ibm.com 2016-07-18 21:21 EDT-------
Yeah, There is only one patch should be backported and it should fix the kernel 
crash. The patch is backported to Ubuntu-4.4.0-31.50 and attached. Note I 
checked out the base kernel code from below git repo:

git://kernel.ubuntu.com/ubuntu/ubuntu-xenial.git     (branch: master)

Another patch (as below link shows) cann't be backported to ubuntu 4.4.0
yet as the fix depends on EEH support for SRIOV which isn't there. Lets
backport it when needed.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=cca0e542e02e48cce541a49c4046ec094ec27c1e
..... ("powerpc/eeh: Fix wrong argument passed to eeh_rmv_device()")

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1603449

Title:
  [LTCTest][Opal][OP820] Machine crashed with Oops: Kernel access of bad
  area, sig: 11 [#1] while executing Froze PE Error injection

Status in linux package in Ubuntu:
  Triaged
Status in linux source package in Xenial:
  In Progress

Bug description:
  == Comment: #0 - PAVAMAN SUBRAMANIYAM <pavsu...@in.ibm.com> - 2016-07-13 
01:28:56 ==
  ---Problem Description---
  Machine crashed with Oops: Kernel access of bad area, sig: 11 [#1]
   
  ---uname output---
  Linux ltc-garri2 4.4.0-30-generic #49-Ubuntu SMP Fri Jul 1 10:00:36 UTC 2016 
ppc64le ppc64le ppc64le GNU/Linux
   
  ---Additional Hardware Info---
  root@ltc-garri2:~# lspci
  0000:00:00.0 PCI bridge: IBM Device 03dc
  0000:01:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]
  0001:00:00.0 PCI bridge: IBM Device 03dc
  0002:00:00.0 PCI bridge: IBM Device 03dc
  0002:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
  0003:00:00.0 PCI bridge: IBM Device 03dc
  0004:00:00.0 PCI bridge: IBM Device 03dc
  0005:00:00.0 PCI bridge: IBM Device 03dc
  0005:01:00.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI 
Express Gen 3 (8.0 GT/s) Switch (rev ab)
  0005:02:01.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI 
Express Gen 3 (8.0 GT/s) Switch (rev ab)
  0005:02:02.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI 
Express Gen 3 (8.0 GT/s) Switch (rev ab)
  0005:02:03.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI 
Express Gen 3 (8.0 GT/s) Switch (rev ab)
  0005:02:04.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI 
Express Gen 3 (8.0 GT/s) Switch (rev ab)
  0005:03:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 
xHCI Host Controller (rev 02)
  0005:04:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9235 PCIe 2.0 
x2 4-port SATA 6 Gb/s Controller (rev 11)
  0005:05:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge 
(rev 03)
  0005:06:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED 
Graphics Family (rev 30)
  0005:07:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5718 
Gigabit Ethernet PCIe (rev 10)
  0005:07:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5718 
Gigabit Ethernet PCIe (rev 10)
  0006:00:00.0 PCI bridge: IBM Device 03dc
  0006:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
  0007:00:00.0 PCI bridge: IBM Device 03dc
  0008:00:00.0 Bridge: IBM Device 04ea
  0008:00:00.1 Bridge: IBM Device 04ea
  0008:00:01.0 Bridge: IBM Device 04ea
  0008:00:01.1 Bridge: IBM Device 04ea
  0009:00:00.0 Bridge: IBM Device 04ea
  0009:00:00.1 Bridge: IBM Device 04ea
  0009:00:01.0 Bridge: IBM Device 04ea
  0009:00:01.1 Bridge: IBM Device 04ea
   

   
  Machine Type = P8 
   
  ---Debugger---
  A debugger is not configured
   
  ---Steps to Reproduce---
   Install a P8 Open Power 8335-GTB Hardware with Ubuntu 16.04.1.
  Then execute the Frozen PE error injection tests as shown below:

  root@ltc-garri2:~# lspci | grep -i 0004:00:00.0
  0004:00:00.0 PCI bridge: IBM Device 03dc
  root@ltc-garri2:~# cat /proc/powerpc/eeh | tail -n 1
  eeh_slot_resets=0

  
  root@ltc-garri2:~# lspci | grep -i 0004:00:00.0
  0004:00:00.0 PCI bridge: IBM Device 03dc
  root@ltc-garri2:~# cat /proc/powerpc/eeh | tail -n 1
  eeh_slot_resets=0
  root@ltc-garri2:~# echo 0:0:4:0:0 > 
/sys/kernel/debug/powerpc/PCI0004/err_injct && lspci -ns 0004:00:00.0; echo $?
  0004:00:00.0 0604: 1014:03dc
  0

  Immediately the kernel crashes with a Oops Message.
   
  Contact Information = pavsu...@in.ibm.com 
   
  Stack trace output:
   [  289.297946] Call Trace:
  [  289.297969] [c000000feeb8b9e0] [c000000000083c78] pnv_eeh_reset+0x58/0x170 
(unreliable)
  [  289.298042] [c000000feeb8ba60] [c000000000038250] eeh_reset_pe+0xb0/0x1c0
  [  289.298105] [c000000feeb8bb00] [c000000000af444c] 
eeh_reset_device+0xd8/0x228
  [  289.298165] [c000000feeb8bba0] [c00000000003c520] 
eeh_handle_normal_event+0x390/0x440
  [  289.298234] [c000000feeb8bc20] [c00000000003c9c4] 
eeh_handle_event+0x184/0x370
  [  289.298304] [c000000feeb8bcd0] [c00000000003cd88] 
eeh_event_handler+0x1d8/0x1e0
  [  289.298374] [c000000feeb8bd80] [c0000000000e6420] kthread+0x110/0x130
  [  289.298434] [c000000feeb8be30] [c000000000009538] 
ret_from_kernel_thread+0x5c/0xa4
  [  289.298501] Instruction dump:
  [  289.298531] 60000000 813f0000 ebdf0010 792affe3 408200d4 e95e0250 812a000c 
2f890002
  [  289.298630] 419e0054 7fe3fb78 4bfb70c5 60000000 <e9230010> 2fa90000 
419e00dc e9290010

   
  Oops output:
   [  289.294622] EEH: Frozen PE#0 on PHB#4 detected
  [  289.294785] EEH: PE location: N/A, PHB location: N/A
  [  289.295598] EEH: This PCI device has failed 1 times in the last hour
  [  289.295600] EEH: Notify device drivers to shutdown
  [  289.295605] EEH: Collect temporary log
  [  289.295632] EEH: of node=0004:00:00:0
  [  289.295635] EEH: PCI device/vendor: 03dc1014
  [  289.295638] EEH: PCI cmd/status register: 00100106
  [  289.295641] EEH: Bridge secondary status: 0000
  [  289.295644] EEH: Bridge control: 0002
  [  289.295645] EEH: PCI-E capabilities and status follow:
  [  289.295654] EEH: PCI-E 00: 00420010 00008002 00000040 00300103
  [  289.295661] EEH: PCI-E 10: 01010008 00000000 00000000 00010010
  [  289.295664] EEH: PCI-E 20: 00000000
  [  289.295665] EEH: PCI-E AER capability register set follows:
  [  289.295674] EEH: PCI-E AER 00: 14810001 00000000 0008d000 00000000
  [  289.295680] EEH: PCI-E AER 10: 00000000 00000000 000001e0 00000000
  [  289.295687] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
  [  289.295690] EEH: PCI-E AER 30: 00000000 00000000
  [  289.295693] PHB3 PHB#4 Diag-data (Version: 1)
  [  289.295695] brdgCtl:     00000002
  [  289.295697] UtlSts:      00080000 00000000 00000000
  [  289.295699] RootSts:     00000040 00000000 01010008 00100102 00000000
  [  289.295701] PhbSts:      0000001c00000000 0000001c00000000
  [  289.295704] Lem:         0000000000100000 42498e367f502eae 0000000000000000
  [  289.295706] InAErr:      4000000000000000 4000000000000000 
0202000000000000 0000000000000000
  [  289.295708] PE[  0] A/B: 8440002b00000000 8000000000000000
  [  289.295711] EEH: Reset with hotplug activity
  [  289.295726] pci_bus 0004:01: busn_res: [bus 01] is released
  [  289.295868] Unable to handle kernel paging request for data at address 
0x00000010
  [  289.295937] Faulting instruction address: 0xc000000000083c7c
  [  289.295997] Oops: Kernel access of bad area, sig: 11 [#1]
  [  289.296043] SMP NR_CPUS=2048 NUMA PowerNV
  [  289.296098] Modules linked in: ip6table_filter ip6_tables iptable_filter 
ip_tables x_tables ipmi_devintf input_leds joydev mac_hid hid_generic usbhid 
hid nvidia(POE) opal_prd ofpart cmdlinepart ibmpowernv at24 powernv_flash 
uio_pdrv_genirq ipmi_powernv mtd ipmi_msghandler powernv_rng uio autofs4 uas 
usb_storage ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect 
sysimgblt fb_sys_fops drm ahci libahci mlx5_core
  [  289.296657] CPU: 1 PID: 651 Comm: eehd Tainted: P           OE   
4.4.0-30-generic #49-Ubuntu
  [  289.296726] task: c000000feeb02a20 ti: c000000feeb88000 task.ti: 
c000000feeb88000
  [  289.296787] NIP: c000000000083c7c LR: c000000000083c78 CTR: 
c000000000083c20
  [  289.296848] REGS: c000000feeb8b760 TRAP: 0300   Tainted: P           OE    
(4.4.0-30-generic)
  [  289.296915] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28008822  
XER: 00000000
  [  289.297065] CFAR: c000000000008468 DAR: 0000000000000010 DSISR: 40000000 
SOFTE: 1
                 GPR00: c000000000083c78 c000000feeb8b9e0 c0000000015b5d00 
0000000000000000
                 GPR04: 0000000000000001 c000000feeb8bac0 c000001e4e693540 
0000000000000ff7
                 GPR08: 0000000000000000 0000000000000000 0000000000000000 
000000000000001c
                 GPR12: c000000000083c20 c000000007b20980 c0000000000e6318 
c000001e4e7a0340
                 GPR16: 0000000000000000 0000000000000000 0000000000000000 
0000000000000000
                 GPR20: 0000000000000000 0000000000000000 0000000000000000 
c000000000d42468
                 GPR24: c000000000d42440 0000000000000100 c000000000036460 
0000000000000000
                 GPR28: c00000000161a3f0 0000000000000001 c000001fff764480 
c000001e4e744000
  [  289.297867] NIP [c000000000083c7c] pnv_eeh_reset+0x5c/0x170
  [  289.297907] LR [c000000000083c78] pnv_eeh_reset+0x58/0x170
  [  289.297946] Call Trace:
  [  289.297969] [c000000feeb8b9e0] [c000000000083c78] pnv_eeh_reset+0x58/0x170 
(unreliable)
  [  289.298042] [c000000feeb8ba60] [c000000000038250] eeh_reset_pe+0xb0/0x1c0
  [  289.298105] [c000000feeb8bb00] [c000000000af444c] 
eeh_reset_device+0xd8/0x228
  [  289.298165] [c000000feeb8bba0] [c00000000003c520] 
eeh_handle_normal_event+0x390/0x440
  [  289.298234] [c000000feeb8bc20] [c00000000003c9c4] 
eeh_handle_event+0x184/0x370
  [  289.298304] [c000000feeb8bcd0] [c00000000003cd88] 
eeh_event_handler+0x1d8/0x1e0
  [  289.298374] [c000000feeb8bd80] [c0000000000e6420] kthread+0x110/0x130
  [  289.298434] [c000000feeb8be30] [c000000000009538] 
ret_from_kernel_thread+0x5c/0xa4
  [  289.298501] Instruction dump:
  [  289.298531] 60000000 813f0000 ebdf0010 792affe3 408200d4 e95e0250 812a000c 
2f890002
  [  289.298630] 419e0054 7fe3fb78 4bfb70c5 60000000 <e9230010> 2fa90000 
419e00dc e9290010
  [  289.298731] ---[ end trace 393da961db41eff1 ]---
  [  289.452447]

   
  System Dump Info:
    The system is not configured to capture a system dump.
   
  *Additional Instructions for pavsu...@in.ibm.com: 
  -Post a private note with access information to the machine that the bug is 
occuring on. 
  -Attach sysctl -a output output to the bug.

  == Comment: #2 - Guo Wen Shan <gws...@au1.ibm.com> - 2016-07-15 09:42:09 ==
  Below two patches are needed:

  
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=cca0e542e02e48cce541a49c4046ec094ec27c1e
  ("powerpc/eeh: Fix wrong argument passed to eeh_rmv_device()")

  
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a3aa256b7258b3d19f8b44557cc64525a993b941
  ("powerpc/eeh: Fix invalid cached PE primary bus")

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1603449/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to