You have been subscribed to a public bug:

== Comment: #0 - ABDUL HALEEM <> - 2018-02-16 08:26:15 ==
Problem:
------------
Injecting error multiple times causes kernel crash.

echo 0x0:1:4:0x6000008000000:0xfff80000 >
/sys/kernel/debug/powerpc/PCI0000/err_injct

EEH: PHB#0 failure detected, location: N/A
EEH: PHB#0-PE#0 has failed 6 times in the
last hour and has been permanently disabled.
EEH: Unable to recover from failure from PHB#0-PE#0.
Please try reseating or replacing it
ixgbe 0000:01:00.1: Adapter removed
kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/drivers/pci/msi.c:352!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache joydev input_leds mac_hid 
idt_89hpesx ofpart ipmi_powernv cmdlinepart ipmi_devintf ipmi_msghandler at24 
powernv_flash mtd opal_prd ibmpowernv uio_pdrv_genirq vmx_crypto uio 
sch_fq_codel nfsd auth_rpcgss nfs_acl lockd grace sunrpc ib_iser rdma_cm iw_cm 
ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables 
x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 
multipath linear ses enclosure scsi_transport_sas qla2xxx ast hid_generic ttm 
drm_kms_helper ixgbe syscopyarea usbhid igb sysfillrect sysimgblt nvme_fc 
fb_sys_fops hid nvme_fabrics crct10dif_vpmsum crc32c_vpmsum drm i40e 
scsi_transport_fc aacraid i2c_algo_bit mdio
CPU: 28 PID: 972 Comm: eehd Not tainted 4.15.0-10-generic #11-Ubuntu
NIP:  c00000000077f080 LR: c00000000077f070 CTR: c0000000000aac30
REGS: c000000ff1deb5a0 TRAP: 0700   Not tainted  (4.15.0-10-generic)
MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24002822  XER: 20040000
CFAR: c00000000018bddc SOFTE: 1
GPR00: c00000000077f070 c000000ff1deb820 c0000000016ea600 c000000fbb5fac00
GPR04: 00000000000002c5 0000000000000000 0000000000000000 0000000000000000
GPR08: c000000fbb5fac00 0000000000000001 c000000fec617a00 c000000fdfd86488
GPR12: 0000000000000040 c000000007a33400 c000000000138be8 c000000ff90ec1c0
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000f48d10
GPR24: c000000000f48ce8 c000200e4fcf4000 c000000fc6900b18 c000200e4fcf4000
GPR28: c000200e4fcf4288 c008000010624480 0000000000000000 c000000fbb633ea0
NIP [c00000000077f080] free_msi_irqs+0xa0/0x260
LR [c00000000077f070] free_msi_irqs+0x90/0x260
Call Trace:
[c000000ff1deb820] [c00000000077f070] free_msi_irqs+0x90/0x260 (unreliable)
[c000000ff1deb880] [c00000000077fa68] pci_disable_msix+0x128/0x170
[c000000ff1deb8c0] [c00800001060b5c8] 
ixgbe_reset_interrupt_capability+0x90/0xd0 [ixgbe]
[c000000ff1deb8f0] [c0080000105d52f4] ixgbe_remove+0xec/0x240 [ixgbe]
[c000000ff1deb990] [c0000000007670ec] pci_device_remove+0x6c/0x110
[c000000ff1deb9d0] [c00000000085d194] device_release_driver_internal+0x224/0x310
[c000000ff1deba20] [c00000000075b398] pci_stop_bus_device+0x98/0xe0
[c000000ff1deba60] [c00000000075b588] pci_stop_and_remove_bus_device+0x28/0x40
[c000000ff1deba90] [c00000000005e1d0] pci_hp_remove_devices+0x90/0x130
[c000000ff1debb20] [c00000000005e184] pci_hp_remove_devices+0x44/0x130
[c000000ff1debbb0] [c00000000003ec04] eeh_handle_normal_event+0x134/0x580
[c000000ff1debc60] [c00000000003f160] eeh_handle_event+0x30/0x338
[c000000ff1debd10] [c00000000003f830] eeh_event_handler+0x140/0x200
[c000000ff1debdc0] [c000000000138d88] kthread+0x1a8/0x1b0
[c000000ff1debe30] [c00000000000b528] ret_from_kernel_thread+0x5c/0xb4
Instruction dump:
419effe0 3bc00000 4800000c 60420000 807f0010 7c7e1a14 78630020 4ba0cd3d
60000000 e9430158 312affff 7d295110 <0b090000> 813f0014 395e0001 7d5e07b4
---[ end trace 23c446a470e60864 ]---
ixgbe 0000:01:00.0: Adapter removed

Sending IPI to other CPUs
OPAL: Switch to big-endian OS
OPAL: Switch to little-endian OS
PHB#0000[0:0]: eeh_freeze_clear on fenced PHB

 
---uname output---
Linux ltciofvtr-bostonlc1 4.15.0-10-generic #11-Ubuntu SMP Tue Feb 13 18:21:52 
UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
 
Machine Type = Boston-LC 

0000:00:00.0 PCI bridge [0604]: IBM Device [1014:04c1]
0000:01:00.0 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection [8086:10fb] (rev 01)
0000:01:00.1 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit 
SFI/SFP+ Network Connection [8086:10fb] (rev 01)

# ethtool  -i enp1s0f0
driver: ixgbe
version: 5.1.0-k
firmware-version: 0x800006da
expansion-rom-version: 
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

 Userspace tool common name: EEH

== Comment: #6 - Mauro Rodrigues <> - 2018-03-19 11:54:03 ==
Even though, probably it will not be accepted as is, I'll send a solution 
upstream.

The long story short: we add ixgbe_free_irq right before the 
ixgbe_clear_interrupt_scheme in ixgbe_remove
That created a side effect, this is hotplug remove and with the patch applied, 
with the usual removal path (for instance from unbind in sysfs) that removes 
the interruption twice.
To avoid that I'll send a patch that integrates the free_irq in the clear 
interruption schema code path.

== Comment: #8 - Mauro Rodrigues <> - 2018-04-18 12:23:34 ==
waiting for upstream feedback at:
http://patchwork.ozlabs.org/patch/900279/

which reads "ixgbe: Fix free irq process when removing device due to PCI
Errors"

== Comment: #9 - Mauro Rodrigues <> - 2018-05-03 11:56:49 ==
The v3 of the patch is going through intel's queue for further testing 
http://patchwork.ozlabs.org/patch/907695/
which reads: "ixgbe/ixgbevf: Free IRQ when PCI error recovery removes the 
device"


== Comment: #11 - Mauro Rodrigues <> - 2018-06-11 10:06:35 ==
 this got merged to Torvald's tree last week and I didn't notice before.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/net/ethernet/intel/ixgbe?id=b212d815e77c72be921979119c715166cc8987b1

which reads:
"ixgbe/ixgbevf: Free IRQ when PCI error recovery removes the device"

I'll submit to canonical ML today.

** Affects: linux (Ubuntu)
     Importance: Undecided
     Assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
         Status: New


** Tags: architecture-ppc64le bugnameltc-164762 severity-high 
targetmilestone-inin1804
-- 
[Ubuntu 1804][boston][ixgbe] EEH causes kernel BUG at 
/build/linux-jWa1Fv/linux-4.15.0/drivers/pci/msi.c:352 (i2S)
https://bugs.launchpad.net/bugs/1776389
You received this bug notification because you are a member of Kernel Packages, 
which is subscribed to linux in Ubuntu.

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to