You have been subscribed to a public bug: ---Problem Description--- EEH is not working with mlx4 driver. When the driver recovered it hits another EEH. ---uname output--- Linux ubuntu 3.18.0-12-generic #13 SMP Mon Feb 9 16:31:42 CST 2015 ppc64le ppc64le ppc64le GNU/Linux ---Additional Hardware Info--- Need Mellanox adapter like Connect 3 adapter.
Machine Type = P8 ---Steps to Reproduce--- Just inject EEH to mlx4 device. Stack trace output: from EEH recovery then it hits this: [ 188.747571] EEH: Collect temporary log [ 188.748330] EEH: of node=/pci@800000020000007/ethernet@3 [ 188.748339] EEH: PCI device/vendor: 100715b3 [ 188.748361] EEH: PCI cmd/status register: 00100146 [ 188.748362] EEH: PCI-E capabilities and status follow: [ 188.748459] EEH: PCI-E 00: 00020010 10008e02 0001200e 0843f483 [ 188.748537] EEH: PCI-E 10: 10830000 00000000 00000000 00000000 [ 188.748539] EEH: PCI-E 20: 00000000 [ 188.748540] EEH: PCI-E AER capability register set follows: [ 188.748625] EEH: PCI-E AER 00: 00020001 00000000 00000000 00062010 [ 188.748704] EEH: PCI-E AER 10: 00002000 00002000 000001e0 00000000 [ 188.748783] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [ 188.748805] EEH: PCI-E AER 30: 00000000 00000000 [ 188.748813] EEH: Reset without hotplug activity [ 193.833245] EEH: Notify device drivers the completion of reset [ 193.833257] mlx4_core: Initializing 0001:00:03.0 [ 193.833317] mlx4_core 0001:00:03.0: BAR 0: can't reserve [mem 0x170b0000000-0x170b00fffff] [ 193.833321] mlx4_core 0001:00:03.0: Couldn't get PCI resources, aborting [ 193.833395] EEH: Not recovered [ 193.833397] EEH: Unable to recover from failure from PHB#1-PE#1. Please try reseating or replacing it [ 193.834531] EEH: of node=/pci@800000020000007/ethernet@3 [ 193.834547] EEH: PCI device/vendor: 100715b3 [ 193.834580] EEH: PCI cmd/status register: 00100142 [ 193.834582] EEH: PCI-E capabilities and status follow: [ 193.834728] EEH: PCI-E 00: 00020010 10008e02 0000200e 0843f483 [ 193.834846] EEH: PCI-E 10: 10830000 00000000 00000000 00000000 [ 193.834849] EEH: PCI-E 20: 00000000 [ 193.834850] EEH: PCI-E AER capability register set follows: [ 193.834981] EEH: PCI-E AER 00: 00020001 00000000 00000000 00062010 [ 193.835101] EEH: PCI-E AER 10: 00002000 00002000 000001e0 00000000 [ 193.835219] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [ 193.835252] EEH: PCI-E AER 30: 00000000 00000000 [ 193.835289] Unable to handle kernel paging request for data at address 0x00000388 [ 193.835356] Faulting instruction address: 0xd000000001f3231c [ 193.835415] Oops: Kernel access of bad area, sig: 11 [#1] [ 193.835460] SMP NR_CPUS=2048 NUMA pSeries [ 193.835509] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc rtc_generic mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_core [ 193.835886] CPU: 6 PID: 50 Comm: eehd Not tainted 3.18.0-12-generic #13 [ 193.835942] task: c0000003f72ca880 ti: c0000003f707c000 task.ti: c0000003f707c000 [ 193.836009] NIP: d000000001f3231c LR: d000000001f32790 CTR: d000000001f32760 [ 193.836076] REGS: c0000003f707f790 TRAP: 0300 Not tainted (3.18.0-12-generic) [ 193.836141] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE> CR: 44000048 XER: 20000000 [ 193.836302] CFAR: c0000000000a7be0 DAR: 0000000000000388 DSISR: 40000000 SOFTE: 1 GPR00: d000000001f32790 c0000003f707fa10 d000000001f66310 c0000003fe0ad000 GPR04: 0000000000000003 0000000000000000 0000000000000000 c0000003fd000000 GPR08: 0000000000000001 d000000001f32760 00000000fffffffa 0000000100001001 GPR12: d000000001f32760 c00000000fb83600 c0000000000d9118 c0000003f90e56c0 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000c4ab90 GPR24: c000000000c4ab68 0000000000100100 c0000003fe068580 c0000003fe068580 GPR28: c0000003fe0ad000 c0000003fe0685e0 d000000001f5da50 0000000000000000 [ 193.837205] NIP [d000000001f3231c] mlx4_unload_one+0x3c/0x480 [mlx4_core] [ 193.837269] LR [d000000001f32790] mlx4_pci_err_detected+0x30/0x60 [mlx4_core] [ 193.837336] Call Trace: [ 193.837361] [c0000003f707fa10] [c0000003fe068580] 0xc0000003fe068580 (unreliable) [ 193.837447] [c0000003f707faa0] [d000000001f32790] mlx4_pci_err_detected+0x30/0x60 [mlx4_core] [ 193.837528] [c0000003f707fae0] [c00000000003ac64] eeh_report_failure+0xb4/0xf0 [ 193.837606] [c0000003f707fb10] [c0000000000393b4] eeh_pe_dev_traverse+0x94/0x160 [ 193.837685] [c0000003f707fba0] [c00000000003b148] eeh_handle_normal_event+0xa8/0x400 [ 193.837764] [c0000003f707fc20] [c00000000003b6b4] eeh_handle_event+0x54/0x360 [ 193.837832] [c0000003f707fcd0] [c00000000003bae4] eeh_event_handler+0x124/0x1d0 [ 193.837911] [c0000003f707fd80] [c0000000000d9220] kthread+0x110/0x130 [ 193.837980] [c0000003f707fe30] [c000000000009568] ret_from_kernel_thread+0x5c/0x74 [ 193.838057] Instruction dump: [ 193.838094] fb41ffd0 fb61ffd8 fb81ffe0 fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff71 [ 193.838217] 7c7c1b78 48000008 e8410018 ebfc0138 <813f0388> 2f890000 409e020c e93f0008 [ 193.838341] ---[ end trace 7cd21329722bcbd1 ]--- There is a series of patches in this link that should resolve this issue. http://permalink.gmane.org/gmane.linux.network/347777 I had applied these in upstream kernel and it is ok but let me double check with Ubuntu 15.04 kernel if these are the patches we need to solve this bugzilla. I used this kernel from Ubuntu 15.04 3.18.0-13.14 To make EEH work, to try to reach the first 2 patches of that series I have to use all this patches: >From ca9f9f703950e5cb300526549b4f1b0a6605a5c5 Mon Sep 17 00:00:00 2001 From: Amir Vadai <am...@mellanox.com> Date: Tue, 25 Feb 2014 18:17:52 +0200 Subject: net/mlx4_en: Fix bad use of dev_id >From adbc7ac5c15eb5e9d70393428345e72a1a897d6a Mon Sep 17 00:00:00 2001 From: Saeed Mahameed <sae...@mellanox.com> Date: Mon, 27 Oct 2014 11:37:37 +0200 Subject: net/mlx4_core: Introduce ACCESS_REG CMD and eth_prot_ctrl dev cap >From a53e3e8c1db547981e13d1ebf24a659bd4e87710 Mon Sep 17 00:00:00 2001 From: Saeed Mahameed <sae...@mellanox.com> Date: Mon, 27 Oct 2014 11:37:38 +0200 Subject: net/mlx4_core: Add ethernet backplane autoneg device capability >From d475c95b4bcff983ac76e8522bfd2d29bcc567d0 Mon Sep 17 00:00:00 2001 From: Matan Barak <mat...@mellanox.com> Date: Sun, 2 Nov 2014 16:26:17 +0200 Subject: net/mlx4_core: Add retrieval of CONFIG_DEV parameters >From dd65beac48a5259945846956d4b27344dfb73bd9 Mon Sep 17 00:00:00 2001 From: Shani Michaeli <sha...@mellanox.com> Date: Sun, 9 Nov 2014 13:51:52 +0200 Subject: net/mlx4_en: Extend usage of napi_gro_frags >From f8c6455bb04b944edb69e9b074e28efee2c56bdd Mon Sep 17 00:00:00 2001 From: Shani Michaeli <sha...@mellanox.com> Date: Sun, 9 Nov 2014 13:51:53 +0200 Subject: net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE >From ffc39f6d6fff2878c55ffa5ffb1828d7618c0a29 Mon Sep 17 00:00:00 2001 From: Matan Barak <mat...@mellanox.com> Date: Thu, 13 Nov 2014 14:45:29 +0200 Subject: net/mlx4_core: Refactor mlx4_cmd_init and mlx4_cmd_cleanup >From a0eacca948d2d4531a393d82a736ff19b7b8fa0b Mon Sep 17 00:00:00 2001 From: Matan Barak <mat...@mellanox.com> Date: Thu, 13 Nov 2014 14:45:30 +0200 Subject: net/mlx4_core: Refactor mlx4_load_one >From e8c4265bea8437f5583d0c2f272058200ebc10ff Mon Sep 17 00:00:00 2001 From: Matan Barak <mat...@mellanox.com> Date: Thu, 13 Nov 2014 14:45:31 +0200 Subject: net/mlx4_core: Add QUERY_FUNC firmware command >From 7ae0e400cd9396c41fe596d35dcc34feaa89a04f Mon Sep 17 00:00:00 2001 From: Matan Barak <mat...@mellanox.com> Date: Thu, 13 Nov 2014 14:45:32 +0200 Subject: net/mlx4_core: Flexible (asymmetric) allocation of EQs and MSI-X vectors for PF/VFs >From da315679e80635021e98de1306ff4eee0759ba57 Mon Sep 17 00:00:00 2001 From: Matan Barak <mat...@mellanox.com> Date: Sun, 14 Dec 2014 16:18:04 +0200 Subject: net/mlx4_core: Fixed memory leak and incorrect refcount in with those patches I can apply from the series that I pointed: ==> 0001-net-mlx4_core-Maintain-a-persistent-memory-for-mlx4-.patch <== >From 872bf2fb69d90e3619befee842fc26db39d8e475 Mon Sep 17 00:00:00 2001 From: Yishai Hadas <yish...@mellanox.com> Date: Sun, 25 Jan 2015 16:59:35 +0200 Subject: net/mlx4_core: Maintain a persistent memory for mlx4 device ==> 0002-net-mlx4_core-Set-device-configuration-data-to-be-pe.patch <== >From dd0eefe3abbf47442db296bf68f27eb2860c1cdf Mon Sep 17 00:00:00 2001 From: Yishai Hadas <yish...@mellanox.com> Date: Sun, 25 Jan 2015 16:59:36 +0200 Subject: net/mlx4_core: Set device configuration data to be persistent across reset ==> 0003-net-mlx4_core-Refactor-the-catas-flow-to-work-per-de.patch <== >From ad9a0bf08ffbf32b8f292c3bb78ca0f24bb8f6b2 Mon Sep 17 00:00:00 2001 From: Yishai Hadas <yish...@mellanox.com> Date: Sun, 25 Jan 2015 16:59:37 +0200 Subject: net/mlx4_core: Refactor the catas flow to work per device ==> 0004-net-mlx4_core-Enhance-the-catas-flow-to-support-devi.patch <== >From f6bc11e42646e661e699a5593cbd1e9dba7191d0 Mon Sep 17 00:00:00 2001 From: Yishai Hadas <yish...@mellanox.com> Date: Sun, 25 Jan 2015 16:59:38 +0200 Subject: net/mlx4_core: Enhance the catas flow to support device reset ==> 0005-net-mlx4_core-Activate-reset-flow-upon-fatal-command.patch <== >From f5aef5aa35063f2b45c3605871cd525d0cb7fb7a Mon Sep 17 00:00:00 2001 From: Yishai Hadas <yish...@mellanox.com> Date: Sun, 25 Jan 2015 16:59:39 +0200 Subject: net/mlx4_core: Activate reset flow upon fatal command cases ==> 0006-net-mlx4_core-Manage-interface-state-for-Reset-flow-.patch <== >From c69453e294c9f16da977b68e658a8028b854c209 Mon Sep 17 00:00:00 2001 From: Yishai Hadas <yish...@mellanox.com> Date: Sun, 25 Jan 2015 16:59:40 +0200 Subject: net/mlx4_core: Manage interface state for Reset flow cases ==> 0007-net-mlx4_core-Handle-AER-flow-properly.patch <== >From 2ba5fbd62b2534335f4e3b844ecc7860115525a3 Mon Sep 17 00:00:00 2001 From: Yishai Hadas <yish...@mellanox.com> Date: Sun, 25 Jan 2015 16:59:41 +0200 Subject: net/mlx4_core: Handle AER flow properly but to apply the whole series to include SRIOV EEH, then I need these extra packages: ==> 0008-g-mlx4.patch <== >From 225c6c8c6bbbc32455df3d1c0fb1e1e1fb51c533 Mon Sep 17 00:00:00 2001 From: Matan Barak <mat...@mellanox.com> Date: Thu, 13 Nov 2014 14:45:28 +0200 Subject: net/mlx4_core: Use correct variable type for mlx4_slave_cap ==> 0008-l-mlx4.patch <== >From de966c5928026b100a989c8cef761d306310a184 Mon Sep 17 00:00:00 2001 From: Matan Barak <mat...@mellanox.com> Date: Thu, 13 Nov 2014 14:45:33 +0200 Subject: net/mlx4_core: Support more than 64 VFs ==> 0008-m-mlx4.patch <== >From 383677da43fa83b390888cf7d25885166b2a6812 Mon Sep 17 00:00:00 2001 From: Or Gerlitz <ogerl...@mellanox.com> Date: Thu, 11 Dec 2014 10:57:52 +0200 Subject: net/mlx4_core: Mask out host side virtualization features for guests ==> 0008-net-mlx4_core-Enable-device-recovery-flow-with-SRIOV.patch <== >From 55ad359225b2232b9b8f04a0dfa169bd3a7d86d2 Mon Sep 17 00:00:00 2001 From: Yishai Hadas <yish...@mellanox.com> Date: Sun, 25 Jan 2015 16:59:42 +0200 Subject: net/mlx4_core: Enable device recovery flow with SRIOV ==> 0008-n-mlx4.patch <== >From ddae0349fdb78bcc5e7219061847012aa1a29069 Mon Sep 17 00:00:00 2001 From: Eugenia Emantayev <euge...@mellanox.co.il> Date: Thu, 11 Dec 2014 10:57:54 +0200 Subject: net/mlx4: Change QP allocation scheme ==> 0008-o-mlx4.patch <== >From 431df8c7e9708433459fd806a08308997de43121 Mon Sep 17 00:00:00 2001 From: Matan Barak <mat...@mellanox.com> Date: Thu, 11 Dec 2014 10:57:59 +0200 Subject: net/mlx4: Refactor QUERY_PORT ==> 0008-p-mlx4.patch <== >From ab256e5ad02b36951f01bf6b5cfda25f14820847 Mon Sep 17 00:00:00 2001 From: Dotan Barak <dot...@dev.mellanox.co.il> Date: Thu, 11 Dec 2014 10:57:55 +0200 Subject: net/mlx4: Add a check if there are too many reserved QPs ==> 0008-r-mlx4.patch <== >From d57febe1a47801ef8a55dbf10672850523dfaa60 Mon Sep 17 00:00:00 2001 From: Matan Barak <mat...@mellanox.com> Date: Thu, 11 Dec 2014 10:57:57 +0200 Subject: net/mlx4: Add A0 hybrid steering ==> 0008-s-mlx4.patch <== >From 7d077cd34eabb2ffd05abe0f2cad01da1ef11712 Mon Sep 17 00:00:00 2001 From: Matan Barak <mat...@mellanox.com> Date: Thu, 11 Dec 2014 10:58:00 +0200 Subject: net/mlx4: Add support for A0 steering ==> 0008-z-mlx4.patch <== >From 7a89399ffad7b7c47b43afda010309b3b88538c0 Mon Sep 17 00:00:00 2001 From: Matan Barak <mat...@mellanox.com> Date: Thu, 11 Dec 2014 10:57:56 +0200 Subject: net/mlx4: Add mlx4_bitmap zone allocator So then I can apply these >From 55ad359225b2232b9b8f04a0dfa169bd3a7d86d2 Mon Sep 17 00:00:00 2001 From: Yishai Hadas <yish...@mellanox.com> Date: Sun, 25 Jan 2015 16:59:42 +0200 Subject: net/mlx4_core: Enable device recovery flow with SRIOV ==> 0009-net-mlx4_core-Reset-flow-activation-upon-SRIOV-fatal.patch <== >From 0cd9302734111abc0b5912b695336f2ee63cb22b Mon Sep 17 00:00:00 2001 From: Yishai Hadas <yish...@mellanox.com> Date: Sun, 25 Jan 2015 16:59:43 +0200 Subject: net/mlx4_core: Reset flow activation upon SRIOV fatal command cases So basically to apply the series will need a lot of patches and probably restest the driver. ** Affects: linux (Ubuntu) Importance: Undecided Status: New ** Tags: architecture-ppc64le bot-comment bugnameltc-121681 severity-high targetmilestone-inin1504 -- mlx4 not recovering from EEH in Ubuntu 15.04 (Mellanox) https://bugs.launchpad.net/bugs/1422481 You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp