Hello Mr. Schnelle. I have reviewed the code and the log, and I think I understood what is the bug. As far I understand, the bug is as you pointed out in the mail[1], switching the call order of the two function. running mlx5_drain_health_wq() prevents new health works to be queue, so when we calling to mlx5_unregister_device() the driver in unaware that the VF might be missing. I will start working on a patch to fix this.
[1] https://lkml.org/lkml/2020/6/12/376 On 7/6/2020 19:12, Niklas Schnelle wrote:
Hi Mr. Drory, Hi Netdev List, I'm the PCI Subsystem maintainer for Linux on IBM Z and since v5.8-rc1 we've been seeing a regression with hot unplug of ConnectX-4 VFs from z/VM guests. In -rc1 this still looked like a simple issue and I wrote the following mail: https://lkml.org/lkml/2020/6/12/376 sadly since I think -rc2 I've not been able to get this working consistently anymore (it did work consistently with the change described above on -rc1). In his answer Saeed Mahameed pointed me to your commits as dealing with similar issues so I wanted to get some input on how to debug this further. The commands I used to test this are as follows (on a z/VM guest running vanilla debug_defconfig v5.8-rc4 installed on Fedora 31) and you find the resulting dmesg attached to this mail: # vmcp q pcif // query for available PCI devices # vmcp attach pcif <FID> to \* // where <FID> is one of the ones listed by the above command # vmcp detach pcif <FID> // This does a hot unplug and is where things start going wrong I guess you don't have access to hardware but I'll be happy to assist as good as I can since digging on my own I sadly really don't know enough about the mlx5_core driver to make more progress. Best regards, Niklas Schnelle