You have been subscribed to a public bug: == Comment: #0 - PAVAMAN SUBRAMANIYAM <pavsu...@in.ibm.com> - 2016-07-13 01:28:56 == ---Problem Description--- Machine crashed with Oops: Kernel access of bad area, sig: 11 [#1] ---uname output--- Linux ltc-garri2 4.4.0-30-generic #49-Ubuntu SMP Fri Jul 1 10:00:36 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux ---Additional Hardware Info--- root@ltc-garri2:~# lspci 0000:00:00.0 PCI bridge: IBM Device 03dc 0000:01:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB] 0001:00:00.0 PCI bridge: IBM Device 03dc 0002:00:00.0 PCI bridge: IBM Device 03dc 0002:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1) 0003:00:00.0 PCI bridge: IBM Device 03dc 0004:00:00.0 PCI bridge: IBM Device 03dc 0005:00:00.0 PCI bridge: IBM Device 03dc 0005:01:00.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab) 0005:02:01.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab) 0005:02:02.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab) 0005:02:03.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab) 0005:02:04.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab) 0005:03:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller (rev 02) 0005:04:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9235 PCIe 2.0 x2 4-port SATA 6 Gb/s Controller (rev 11) 0005:05:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 03) 0005:06:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30) 0005:07:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10) 0005:07:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10) 0006:00:00.0 PCI bridge: IBM Device 03dc 0006:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1) 0007:00:00.0 PCI bridge: IBM Device 03dc 0008:00:00.0 Bridge: IBM Device 04ea 0008:00:00.1 Bridge: IBM Device 04ea 0008:00:01.0 Bridge: IBM Device 04ea 0008:00:01.1 Bridge: IBM Device 04ea 0009:00:00.0 Bridge: IBM Device 04ea 0009:00:00.1 Bridge: IBM Device 04ea 0009:00:01.0 Bridge: IBM Device 04ea 0009:00:01.1 Bridge: IBM Device 04ea
Machine Type = P8 ---Debugger--- A debugger is not configured ---Steps to Reproduce--- Install a P8 Open Power 8335-GTB Hardware with Ubuntu 16.04.1. Then execute the Frozen PE error injection tests as shown below: root@ltc-garri2:~# lspci | grep -i 0004:00:00.0 0004:00:00.0 PCI bridge: IBM Device 03dc root@ltc-garri2:~# cat /proc/powerpc/eeh | tail -n 1 eeh_slot_resets=0 root@ltc-garri2:~# lspci | grep -i 0004:00:00.0 0004:00:00.0 PCI bridge: IBM Device 03dc root@ltc-garri2:~# cat /proc/powerpc/eeh | tail -n 1 eeh_slot_resets=0 root@ltc-garri2:~# echo 0:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0004/err_injct && lspci -ns 0004:00:00.0; echo $? 0004:00:00.0 0604: 1014:03dc 0 Immediately the kernel crashes with a Oops Message. Contact Information = pavsu...@in.ibm.com Stack trace output: [ 289.297946] Call Trace: [ 289.297969] [c000000feeb8b9e0] [c000000000083c78] pnv_eeh_reset+0x58/0x170 (unreliable) [ 289.298042] [c000000feeb8ba60] [c000000000038250] eeh_reset_pe+0xb0/0x1c0 [ 289.298105] [c000000feeb8bb00] [c000000000af444c] eeh_reset_device+0xd8/0x228 [ 289.298165] [c000000feeb8bba0] [c00000000003c520] eeh_handle_normal_event+0x390/0x440 [ 289.298234] [c000000feeb8bc20] [c00000000003c9c4] eeh_handle_event+0x184/0x370 [ 289.298304] [c000000feeb8bcd0] [c00000000003cd88] eeh_event_handler+0x1d8/0x1e0 [ 289.298374] [c000000feeb8bd80] [c0000000000e6420] kthread+0x110/0x130 [ 289.298434] [c000000feeb8be30] [c000000000009538] ret_from_kernel_thread+0x5c/0xa4 [ 289.298501] Instruction dump: [ 289.298531] 60000000 813f0000 ebdf0010 792affe3 408200d4 e95e0250 812a000c 2f890002 [ 289.298630] 419e0054 7fe3fb78 4bfb70c5 60000000 <e9230010> 2fa90000 419e00dc e9290010 Oops output: [ 289.294622] EEH: Frozen PE#0 on PHB#4 detected [ 289.294785] EEH: PE location: N/A, PHB location: N/A [ 289.295598] EEH: This PCI device has failed 1 times in the last hour [ 289.295600] EEH: Notify device drivers to shutdown [ 289.295605] EEH: Collect temporary log [ 289.295632] EEH: of node=0004:00:00:0 [ 289.295635] EEH: PCI device/vendor: 03dc1014 [ 289.295638] EEH: PCI cmd/status register: 00100106 [ 289.295641] EEH: Bridge secondary status: 0000 [ 289.295644] EEH: Bridge control: 0002 [ 289.295645] EEH: PCI-E capabilities and status follow: [ 289.295654] EEH: PCI-E 00: 00420010 00008002 00000040 00300103 [ 289.295661] EEH: PCI-E 10: 01010008 00000000 00000000 00010010 [ 289.295664] EEH: PCI-E 20: 00000000 [ 289.295665] EEH: PCI-E AER capability register set follows: [ 289.295674] EEH: PCI-E AER 00: 14810001 00000000 0008d000 00000000 [ 289.295680] EEH: PCI-E AER 10: 00000000 00000000 000001e0 00000000 [ 289.295687] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [ 289.295690] EEH: PCI-E AER 30: 00000000 00000000 [ 289.295693] PHB3 PHB#4 Diag-data (Version: 1) [ 289.295695] brdgCtl: 00000002 [ 289.295697] UtlSts: 00080000 00000000 00000000 [ 289.295699] RootSts: 00000040 00000000 01010008 00100102 00000000 [ 289.295701] PhbSts: 0000001c00000000 0000001c00000000 [ 289.295704] Lem: 0000000000100000 42498e367f502eae 0000000000000000 [ 289.295706] InAErr: 4000000000000000 4000000000000000 0202000000000000 0000000000000000 [ 289.295708] PE[ 0] A/B: 8440002b00000000 8000000000000000 [ 289.295711] EEH: Reset with hotplug activity [ 289.295726] pci_bus 0004:01: busn_res: [bus 01] is released [ 289.295868] Unable to handle kernel paging request for data at address 0x00000010 [ 289.295937] Faulting instruction address: 0xc000000000083c7c [ 289.295997] Oops: Kernel access of bad area, sig: 11 [#1] [ 289.296043] SMP NR_CPUS=2048 NUMA PowerNV [ 289.296098] Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables x_tables ipmi_devintf input_leds joydev mac_hid hid_generic usbhid hid nvidia(POE) opal_prd ofpart cmdlinepart ibmpowernv at24 powernv_flash uio_pdrv_genirq ipmi_powernv mtd ipmi_msghandler powernv_rng uio autofs4 uas usb_storage ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci mlx5_core [ 289.296657] CPU: 1 PID: 651 Comm: eehd Tainted: P OE 4.4.0-30-generic #49-Ubuntu [ 289.296726] task: c000000feeb02a20 ti: c000000feeb88000 task.ti: c000000feeb88000 [ 289.296787] NIP: c000000000083c7c LR: c000000000083c78 CTR: c000000000083c20 [ 289.296848] REGS: c000000feeb8b760 TRAP: 0300 Tainted: P OE (4.4.0-30-generic) [ 289.296915] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28008822 XER: 00000000 [ 289.297065] CFAR: c000000000008468 DAR: 0000000000000010 DSISR: 40000000 SOFTE: 1 GPR00: c000000000083c78 c000000feeb8b9e0 c0000000015b5d00 0000000000000000 GPR04: 0000000000000001 c000000feeb8bac0 c000001e4e693540 0000000000000ff7 GPR08: 0000000000000000 0000000000000000 0000000000000000 000000000000001c GPR12: c000000000083c20 c000000007b20980 c0000000000e6318 c000001e4e7a0340 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000d42468 GPR24: c000000000d42440 0000000000000100 c000000000036460 0000000000000000 GPR28: c00000000161a3f0 0000000000000001 c000001fff764480 c000001e4e744000 [ 289.297867] NIP [c000000000083c7c] pnv_eeh_reset+0x5c/0x170 [ 289.297907] LR [c000000000083c78] pnv_eeh_reset+0x58/0x170 [ 289.297946] Call Trace: [ 289.297969] [c000000feeb8b9e0] [c000000000083c78] pnv_eeh_reset+0x58/0x170 (unreliable) [ 289.298042] [c000000feeb8ba60] [c000000000038250] eeh_reset_pe+0xb0/0x1c0 [ 289.298105] [c000000feeb8bb00] [c000000000af444c] eeh_reset_device+0xd8/0x228 [ 289.298165] [c000000feeb8bba0] [c00000000003c520] eeh_handle_normal_event+0x390/0x440 [ 289.298234] [c000000feeb8bc20] [c00000000003c9c4] eeh_handle_event+0x184/0x370 [ 289.298304] [c000000feeb8bcd0] [c00000000003cd88] eeh_event_handler+0x1d8/0x1e0 [ 289.298374] [c000000feeb8bd80] [c0000000000e6420] kthread+0x110/0x130 [ 289.298434] [c000000feeb8be30] [c000000000009538] ret_from_kernel_thread+0x5c/0xa4 [ 289.298501] Instruction dump: [ 289.298531] 60000000 813f0000 ebdf0010 792affe3 408200d4 e95e0250 812a000c 2f890002 [ 289.298630] 419e0054 7fe3fb78 4bfb70c5 60000000 <e9230010> 2fa90000 419e00dc e9290010 [ 289.298731] ---[ end trace 393da961db41eff1 ]--- [ 289.452447] System Dump Info: The system is not configured to capture a system dump. *Additional Instructions for pavsu...@in.ibm.com: -Post a private note with access information to the machine that the bug is occuring on. -Attach sysctl -a output output to the bug. == Comment: #2 - Guo Wen Shan <gws...@au1.ibm.com> - 2016-07-15 09:42:09 == Below two patches are needed: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=cca0e542e02e48cce541a49c4046ec094ec27c1e ("powerpc/eeh: Fix wrong argument passed to eeh_rmv_device()") https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a3aa256b7258b3d19f8b44557cc64525a993b941 ("powerpc/eeh: Fix invalid cached PE primary bus") ** Affects: linux (Ubuntu) Importance: High Assignee: Canonical Kernel Team (canonical-kernel-team) Status: Triaged ** Tags: architecture-ppc64le bugnameltc-143706 severity-critical targetmilestone-inin16041 -- [LTCTest][Opal][OP820] Machine crashed with Oops: Kernel access of bad area, sig: 11 [#1] while executing Froze PE Error injection https://bugs.launchpad.net/bugs/1603449 You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp