Hello Mr. Schnelle.

I have reviewed the code and the log, and I think I understood what is the bug.
As far I understand, the bug is as you pointed out in the mail[1], switching 
the call order of the two function.
running mlx5_drain_health_wq() prevents new health works to be queue, so when 
we calling to
mlx5_unregister_device() the driver in unaware that the VF might be missing.
I will start working on a patch to fix this.

[1] https://lkml.org/lkml/2020/6/12/376

On 7/6/2020 19:12, Niklas Schnelle wrote:

Hi Mr. Drory, Hi Netdev List,

I'm the PCI Subsystem maintainer for Linux on IBM Z and since v5.8-rc1
we've been seeing a regression with hot unplug of ConnectX-4 VFs
from z/VM guests. In -rc1 this still looked like a simple issue and
I wrote the following mail:
https://lkml.org/lkml/2020/6/12/376
sadly since I think -rc2 I've not been able to get this working consistently
anymore (it did work consistently with the change described above on -rc1).
In his answer Saeed Mahameed pointed me to your commits as dealing with
similar issues so I wanted to get some input on how to debug this
further.

The commands I used to test this are as follows (on a z/VM guest running
vanilla debug_defconfig v5.8-rc4 installed on Fedora 31) and you find the 
resulting
dmesg attached to this mail:

# vmcp q pcif  // query for available PCI devices
# vmcp attach pcif <FID> to \* // where <FID> is one of the ones listed by the 
above command
# vmcp detach pcif <FID> // This does a hot unplug and is where things start 
going wrong

I guess you don't have access to hardware but I'll be happy to assist
as good as I can since digging on my own I sadly really don't know
enough about the mlx5_core driver to make more progress.

Best regards,
Niklas Schnelle


Reply via email to