There appears to be a reference count bug in mlx5 device on hotplug remove. This is reproducible in Debian testing (5.7.0-2-amd64) which is a recent kernel.
Environment Windows Server 2019 (Hyper-V) with Linux guest. Remove the SR-IOV option from virtual NIC using GUI and hit apply. This causes hotplug removal of PCI device in guest. Kernel gets reference count error. [ 532.333069] hv_netvsc 4179c815-5d8a-4915-976e-9ea2378e382b eth1: Data path switched to VF: enP44703s2np0 [ 532.363165] device: 'mlx5_0': device_add [ 532.370508] PM: Adding info for No Bus:mlx5_0 [ 532.379786] device: 'uverbs0': device_add [ 532.382409] PM: Adding info for No Bus:uverbs0 [ 532.510182] infiniband mlx5_0: renaming to rdmaP44703p0s2 [ 532.547190] device class 'infiniband_cm': registering [ 532.558788] Loading iSCSI transport class v2.0-870. [ 532.567164] device class 'iscsi_transport': registering [ 532.570989] device class 'iscsi_endpoint': registering [ 532.575108] device class 'iscsi_iface': registering [ 532.578727] device class 'iscsi_host': registering [ 532.582560] device class 'iscsi_connection': registering [ 532.582702] device class 'infiniband_mad': registering [ 532.586392] device class 'iscsi_session': registering [ 532.593841] bus: 'iscsi_flashnode': registered [ 532.611082] device: 'iser': device_add [ 532.614577] PM: Adding info for No Bus:iser [ 532.623155] iscsi: registered transport (iser) [ 532.629388] device: 'rdma_cm': device_add [ 532.635064] PM: Adding info for No Bus:rdma_cm [ 532.659457] RPC: Registered named UNIX socket transport module. [ 532.667507] RPC: Registered udp transport module. [ 532.671664] RPC: Registered tcp transport module. [ 532.676335] RPC: Registered tcp NFSv4.1 backchannel transport module. [ 532.695796] RPC: Registered rdma transport module. [ 532.699960] RPC: Registered rdma backchannel transport module. [ 1264.311748] TCP: eth0: Driver has suspect GRO implementation, TCP performance may be compromised. [ 1484.876854] PM: Removing info for No Bus:uverbs0 [ 1484.972841] PM: Removing info for No Bus:rdmaP44703p0s2 [ 1485.730637] hv_netvsc 4179c815-5d8a-4915-976e-9ea2378e382b eth1: Data path switched from VF: enP44703s2np0 [ 1485.738086] hv_netvsc 4179c815-5d8a-4915-976e-9ea2378e382b eth1: VF unregistering: enP44703s2np0 [ 1485.745467] PM: Removing info for No Bus:enP44703s2np0 [ 1486.149250] mlx5_core ae9f:00:02.0: mlx5_destroy_flow_group:2089:(pid 990): Flow group 19 wasn't destroyed, refcount > 1 [ 1486.156135] mlx5_core ae9f:00:02.0: mlx5_destroy_flow_table:2078:(pid 990): Flow table 1048581 wasn't destroyed, refcount > 1 [ 1486.186055] mlx5_core ae9f:00:02.0: mlx5_cmd_check:753:(pid 990): DESTROY_FLOW_TABLE(0x931) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x389e56) [ 1486.197990] mlx5_core ae9f:00:02.0: del_hw_flow_table:454:(pid 990): flow steering can't destroy ft [ 1486.299161] ------------[ cut here ]------------ [ 1486.304071] refcount_t: underflow; use-after-free. [ 1486.309209] WARNING: CPU: 1 PID: 990 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0 [ 1486.313184] Modules linked in: rpcrdma sunrpc rdma_ucm ib_iser rdma_cm iw_cm libiscsi ib_umad ib_ipoib scsi_transport_iscsi configfs ib_cm mlx5_ib ib_uverbs ib_core mlx5_core mlxfw pci_hyperv pci_hyperv_intf act_mirred cls_flower sch_ingress sch_multiq tun intel_rapl_msr intel_rapl_common sb_edac ghash_clmulni_intel aesni_intel libaes nls_ascii crypto_simd nls_cp437 cryptd glue_helper vfat rapl serio_raw fat efi_pstore hv_utils ptp evdev hyperv_fb hyperv_keyboard hv_balloon pps_core sg efivars pcspkr joydev drm efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic sd_mod sr_mod t10_pi crc_t10dif cdrom crct10dif_generic hv_netvsc hid_generic hid_hyperv hv_storvsc scsi_transport_fc hid scsi_mod crct10dif_pclmul crct10dif_common crc32_pclmul hv_vmbus crc32c_intel [ 1486.313184] CPU: 1 PID: 990 Comm: kworker/u8:0 Not tainted 5.7.0-2-amd64 #1 Debian 5.7.10-1 [ 1486.313184] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 01/30/2019 [ 1486.313184] Workqueue: hv_pci_ae9f hv_eject_device_work [pci_hyperv] [ 1486.313184] RIP: 0010:refcount_warn_saturate+0xa6/0xf0 [ 1486.313184] Code: 05 af 73 ce 00 01 e8 1b 6d c5 ff 0f 0b c3 80 3d 9d 73 ce 00 00 75 95 48 c7 c7 70 34 d2 a1 c6 05 8d 73 ce 00 01 e8 fc 6c c5 ff <0f> 0b c3 80 3d 7c 73 ce 00 00 0f 85 72 ff ff ff 48 c7 c7 c8 34 d2 [ 1486.313184] RSP: 0018:ffffbeb9406e3c28 EFLAGS: 00010282 [ 1486.313184] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 1486.313184] RDX: ffff96c283ca9900 RSI: ffff96c283c99ac8 RDI: ffff96c283c99ac8 [ 1486.313184] RBP: ffff96c2686586d8 R08: 000000000000065a R09: ffffbeb945baf01c [ 1486.313184] R10: 0000000000aaaaaa R11: 00000000ff000000 R12: ffff96c26e24e370 [ 1486.313184] R13: ffff96c26e24e3d0 R14: ffff96c26e55ea80 R15: ffff96c1b6010890 [ 1486.313184] FS: 0000000000000000(0000) GS:ffff96c283c80000(0000) knlGS:0000000000000000 [ 1486.313184] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1486.313184] CR2: 00005651a0346e58 CR3: 00000000ecd90003 CR4: 00000000003606e0 [ 1486.313184] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1486.313184] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1486.313184] Call Trace: [ 1486.313184] tree_put_node+0x115/0x150 [mlx5_core] [ 1486.313184] clean_tree+0x48/0xd0 [mlx5_core] [ 1486.313184] clean_tree+0x48/0xd0 [mlx5_core] [ 1486.313184] clean_tree+0x48/0xd0 [mlx5_core] [ 1486.313184] clean_tree+0x58/0xd0 [mlx5_core] [ 1486.313184] clean_tree+0x48/0xd0 [mlx5_core] [ 1486.313184] clean_tree+0x58/0xd0 [mlx5_core] [ 1486.313184] mlx5_cleanup_fs+0x26/0x160 [mlx5_core] [ 1486.313184] mlx5_unload+0x1e/0x80 [mlx5_core] [ 1486.313184] mlx5_unload_one+0x49/0xb0 [mlx5_core] [ 1486.313184] remove_one+0x3f/0x70 [mlx5_core] [ 1486.313184] pci_device_remove+0x3b/0xa0 [ 1486.313184] device_release_driver_internal+0xe4/0x1c0 [ 1486.313184] pci_stop_bus_device+0x68/0x90 [ 1486.313184] pci_stop_and_remove_bus_device+0xe/0x20 [ 1486.313184] hv_eject_device_work+0x6c/0x170 [pci_hyperv] [ 1486.313184] ? __schedule+0x2e2/0x770 [ 1486.313184] process_one_work+0x1b4/0x380 [ 1486.313184] worker_thread+0x50/0x3c0 [ 1486.313184] kthread+0xf9/0x130 [ 1486.313184] ? process_one_work+0x380/0x380 [ 1486.313184] ? kthread_park+0x90/0x90 [ 1486.313184] ret_from_fork+0x35/0x40 [ 1486.313184] ---[ end trace 36bebf916c9c3f8e ]--- [ 1486.526006] mlx5_core ae9f:00:02.0: mlx5_cmd_check:753:(pid 990): DESTROY_FLOW_GROUP(0x934) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x563e2f) [ 1486.534741] mlx5_core ae9f:00:02.0: del_hw_flow_group:584:(pid 990): flow steering can't destroy fg 19 of ft 1048581 [ 1486.544489] mlx5_core ae9f:00:02.0: mlx5_cmd_check:753:(pid 990): DESTROY_FLOW_TABLE(0x931) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x66d6c3) [ 1486.555257] mlx5_core ae9f:00:02.0: del_hw_flow_table:454:(pid 990): flow steering can't destroy ft [ 1486.562116] mlx5_core ae9f:00:02.0: mlx5_cmd_check:753:(pid 990): DESTROY_FLOW_TABLE(0x931) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x389e56) [ 1486.573629] mlx5_core ae9f:00:02.0: del_hw_flow_table:454:(pid 990): flow steering can't destroy ft [ 1487.781955] PM: Removing info for No Bus:ptp1 [ 1488.218522] bus: 'pci': remove device ae9f:00:02.0 [ 1488.223662] PM: Removing info for pci:ae9f:00:02.0 [ 1488.230235] hv_netvsc 4179c815-5d8a-4915-976e-9ea2378e382b eth1: VF slot 2 removed [ 1488.230243] device: '2383f0f2-ae9f-4ffb-be78-a687e08f3d1d': device_unregister [ 1488.247874] bus: 'vmbus': remove device 2383f0f2-ae9f-4ffb-be78-a687e08f3d1d [ 1488.264833] device: 'ae9f:00': device_unregister [ 1488.270150] PM: Removing info for No Bus:ae9f:00 [ 1488.275894] device: 'pciae9f:00': device_unregister [ 1488.281550] PM: Removing info for No Bus:pciae9f:00 [ 1488.287609] PM: Removing info for vmbus:2383f0f2-ae9f-4ffb-be78-a687e08f3d1d