------- Comment From abdha...@in.ibm.com 2018-06-21 03:33 EDT------- Verified on 4.15.0-24-generic and adapter recovery happens neatly after error injection with no Oops messages.
[ 3473.707228] EEH: PHB#2 failure detected, location: N/A [ 3473.707308] CPU: 96 PID: 20922 Comm: lspci Not tainted 4.15.0-24-generic #26-Ubuntu [ 3473.707310] Call Trace: [ 3473.707321] [c0002038006fbb00] [c000000000ce04bc] dump_stack+0xb0/0xf4 (unreliable) [ 3473.707328] [c0002038006fbb40] [c00000000003ade4] eeh_dev_check_failure+0x234/0x5b0 [ 3473.707335] [c0002038006fbbe0] [c0000000000adc58] pnv_pci_read_config+0x128/0x160 [ 3473.707340] [c0002038006fbc20] [c00000000075d1ac] pci_user_read_config_dword+0x8c/0x180 [ 3473.707345] [c0002038006fbc70] [c0000000007722f4] pci_read_config+0x104/0x2d0 [ 3473.707350] [c0002038006fbcf0] [c0000000004a05f0] sysfs_kf_bin_read+0x70/0xd0 [ 3473.707354] [c0002038006fbd10] [c00000000049f540] kernfs_fop_read+0xe0/0x290 [ 3473.707358] [c0002038006fbd60] [c0000000003d517c] __vfs_read+0x3c/0x70 [ 3473.707361] [c0002038006fbd80] [c0000000003d526c] vfs_read+0xbc/0x1b0 [ 3473.707364] [c0002038006fbdd0] [c0000000003d5ae4] SyS_pread64+0xc4/0xf0 [ 3473.707369] [c0002038006fbe30] [c00000000000b284] system_call+0x58/0x6c [ 3473.707381] EEH: Detected error on PHB#2 [ 3473.707384] EEH: This PCI device has failed 8 times in the last hour [ 3473.707385] EEH: Notify device drivers to shutdown [ 3473.707402] ixgbe 0002:01:00.0: Adapter removed [ 3473.730202] ixgbe 0002:01:00.1: Adapter removed [ 3473.752641] EEH: Collect temporary log [ 3473.752644] PHB4 PHB#2 Diag-data (Version: 1) [ 3473.752645] brdgCtl: 00000002 [ 3473.752649] RootSts: 00060040 00402000 c1010008 00100107 00004000 [ 3473.752651] RootErrSts: 00000024 00000020 00000000 [ 3473.752653] sourceId: 01000000 [ 3473.752655] nFir: 0000800000000000 0030001c00000000 0000800000000000 [ 3473.752657] PhbSts: 0000001c00000000 0000001c00000000 [ 3473.752659] Lem: 1001000104300100 0000000000000000 1000000000000000 [ 3473.752661] PhbErr: 00000da000000000 0000010000000000 2148000098000240 a008400000000000 [ 3473.752664] PhbTxeErr: 0000000600000000 0000000200000000 0000000000000000 0000000000000000 [ 3473.752666] RxeArbErr: 0000100030000020 0000000000000020 4000010000000000 0000000000000000 [ 3473.752668] RxeMrgErr: 0000000000000001 0000000000000001 0000000000000000 0000000000000000 [ 3473.752670] RegbErr: 00d0000000000000 0010000000000000 4800012c00000000 0000000007000000 [ 3473.752673] PE[000] A/B: a700000300000000 8101000001010000 [ 3473.752677] PE[100] A/B: 8000000000003bfe 80000000300c3de9 [ 3473.752680] EEH: Reset without hotplug activity [ 3477.113186] EEH: Notify device drivers the completion of reset [ 3477.113197] ixgbe 0002:01:00.0: enabling device (0140 -> 0142) [ 3477.174161] ixgbe 0002:01:00.0: pci_cleanup_aer_uncorrect_error_status failed 0xffffffea [ 3477.174239] ixgbe 0002:01:00.1: enabling device (0140 -> 0142) [ 3477.238148] ixgbe 0002:01:00.1: pci_cleanup_aer_uncorrect_error_status failed 0xffffffea [ 3477.238220] EEH: Notify device driver to resume [ 3477.669705] ixgbe 0002:01:00.0 enP2p1s0f0: detected SFP+: 3 [ 3478.037802] ixgbe 0002:01:00.1 enP2p1s0f1: detected SFP+: 4 [ 3478.337233] ixgbe 0002:01:00.0 enP2p1s0f0: NIC Link is Up 10 Gbps, Flow Control: RX/TX [ 3478.705247] ixgbe 0002:01:00.1 enP2p1s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX Thanks Mauro for all your support ! -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1776389 Title: [Ubuntu 1804][boston][ixgbe] EEH causes kernel BUG at /build/linux- jWa1Fv/linux-4.15.0/drivers/pci/msi.c:352 (i2S) Status in The Ubuntu-power-systems project: Triaged Status in linux package in Ubuntu: Fix Committed Status in linux source package in Bionic: Fix Committed Bug description: == Comment: #0 - ABDUL HALEEM <> - 2018-02-16 08:26:15 == Problem: ------------ Injecting error multiple times causes kernel crash. echo 0x0:1:4:0x6000008000000:0xfff80000 > /sys/kernel/debug/powerpc/PCI0000/err_injct EEH: PHB#0 failure detected, location: N/A EEH: PHB#0-PE#0 has failed 6 times in the last hour and has been permanently disabled. EEH: Unable to recover from failure from PHB#0-PE#0. Please try reseating or replacing it ixgbe 0000:01:00.1: Adapter removed kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/drivers/pci/msi.c:352! Oops: Exception in kernel mode, sig: 5 [#1] LE SMP NR_CPUS=2048 NUMA PowerNV Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache joydev input_leds mac_hid idt_89hpesx ofpart ipmi_powernv cmdlinepart ipmi_devintf ipmi_msghandler at24 powernv_flash mtd opal_prd ibmpowernv uio_pdrv_genirq vmx_crypto uio sch_fq_codel nfsd auth_rpcgss nfs_acl lockd grace sunrpc ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas qla2xxx ast hid_generic ttm drm_kms_helper ixgbe syscopyarea usbhid igb sysfillrect sysimgblt nvme_fc fb_sys_fops hid nvme_fabrics crct10dif_vpmsum crc32c_vpmsum drm i40e scsi_transport_fc aacraid i2c_algo_bit mdio CPU: 28 PID: 972 Comm: eehd Not tainted 4.15.0-10-generic #11-Ubuntu NIP: c00000000077f080 LR: c00000000077f070 CTR: c0000000000aac30 REGS: c000000ff1deb5a0 TRAP: 0700 Not tainted (4.15.0-10-generic) MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002822 XER: 20040000 CFAR: c00000000018bddc SOFTE: 1 GPR00: c00000000077f070 c000000ff1deb820 c0000000016ea600 c000000fbb5fac00 GPR04: 00000000000002c5 0000000000000000 0000000000000000 0000000000000000 GPR08: c000000fbb5fac00 0000000000000001 c000000fec617a00 c000000fdfd86488 GPR12: 0000000000000040 c000000007a33400 c000000000138be8 c000000ff90ec1c0 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000f48d10 GPR24: c000000000f48ce8 c000200e4fcf4000 c000000fc6900b18 c000200e4fcf4000 GPR28: c000200e4fcf4288 c008000010624480 0000000000000000 c000000fbb633ea0 NIP [c00000000077f080] free_msi_irqs+0xa0/0x260 LR [c00000000077f070] free_msi_irqs+0x90/0x260 Call Trace: [c000000ff1deb820] [c00000000077f070] free_msi_irqs+0x90/0x260 (unreliable) [c000000ff1deb880] [c00000000077fa68] pci_disable_msix+0x128/0x170 [c000000ff1deb8c0] [c00800001060b5c8] ixgbe_reset_interrupt_capability+0x90/0xd0 [ixgbe] [c000000ff1deb8f0] [c0080000105d52f4] ixgbe_remove+0xec/0x240 [ixgbe] [c000000ff1deb990] [c0000000007670ec] pci_device_remove+0x6c/0x110 [c000000ff1deb9d0] [c00000000085d194] device_release_driver_internal+0x224/0x310 [c000000ff1deba20] [c00000000075b398] pci_stop_bus_device+0x98/0xe0 [c000000ff1deba60] [c00000000075b588] pci_stop_and_remove_bus_device+0x28/0x40 [c000000ff1deba90] [c00000000005e1d0] pci_hp_remove_devices+0x90/0x130 [c000000ff1debb20] [c00000000005e184] pci_hp_remove_devices+0x44/0x130 [c000000ff1debbb0] [c00000000003ec04] eeh_handle_normal_event+0x134/0x580 [c000000ff1debc60] [c00000000003f160] eeh_handle_event+0x30/0x338 [c000000ff1debd10] [c00000000003f830] eeh_event_handler+0x140/0x200 [c000000ff1debdc0] [c000000000138d88] kthread+0x1a8/0x1b0 [c000000ff1debe30] [c00000000000b528] ret_from_kernel_thread+0x5c/0xb4 Instruction dump: 419effe0 3bc00000 4800000c 60420000 807f0010 7c7e1a14 78630020 4ba0cd3d 60000000 e9430158 312affff 7d295110 <0b090000> 813f0014 395e0001 7d5e07b4 ---[ end trace 23c446a470e60864 ]--- ixgbe 0000:01:00.0: Adapter removed Sending IPI to other CPUs OPAL: Switch to big-endian OS OPAL: Switch to little-endian OS PHB#0000[0:0]: eeh_freeze_clear on fenced PHB ---uname output--- Linux ltciofvtr-bostonlc1 4.15.0-10-generic #11-Ubuntu SMP Tue Feb 13 18:21:52 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux Machine Type = Boston-LC 0000:00:00.0 PCI bridge [0604]: IBM Device [1014:04c1] 0000:01:00.0 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01) 0000:01:00.1 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01) # ethtool -i enp1s0f0 driver: ixgbe version: 5.1.0-k firmware-version: 0x800006da expansion-rom-version: bus-info: 0000:01:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes Userspace tool common name: EEH == Comment: #6 - Mauro Rodrigues <> - 2018-03-19 11:54:03 == Even though, probably it will not be accepted as is, I'll send a solution upstream. The long story short: we add ixgbe_free_irq right before the ixgbe_clear_interrupt_scheme in ixgbe_remove That created a side effect, this is hotplug remove and with the patch applied, with the usual removal path (for instance from unbind in sysfs) that removes the interruption twice. To avoid that I'll send a patch that integrates the free_irq in the clear interruption schema code path. == Comment: #8 - Mauro Rodrigues <> - 2018-04-18 12:23:34 == waiting for upstream feedback at: http://patchwork.ozlabs.org/patch/900279/ which reads "ixgbe: Fix free irq process when removing device due to PCI Errors" == Comment: #9 - Mauro Rodrigues <> - 2018-05-03 11:56:49 == The v3 of the patch is going through intel's queue for further testing http://patchwork.ozlabs.org/patch/907695/ which reads: "ixgbe/ixgbevf: Free IRQ when PCI error recovery removes the device" == Comment: #11 - Mauro Rodrigues <> - 2018-06-11 10:06:35 == this got merged to Torvald's tree last week and I didn't notice before. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/net/ethernet/intel/ixgbe?id=b212d815e77c72be921979119c715166cc8987b1 which reads: "ixgbe/ixgbevf: Free IRQ when PCI error recovery removes the device" I'll submit to canonical ML today. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1776389/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp