I checked with Matthew and found Matthew only applied the first patch [1]; after I applied the second patch [2], I'm no longer seeing any crash or memory corruption issue in Matthew's VM.
BTW, the Windows Server 2019 host running Matthew's VM doesn't work with NIC SR-IOV correctly: when SR-IOV is enabled, the host offers an Intel VF NIC to the VM, then immediately removes/rescinds the VF (this causes hv_pci_probe() to fail and the bug on its error handling path is triggered), and never re-offers the VF, i.e. NIC SR-IOV doesn't work on this host, but that's a host bug and the host team needs to investigate that. [0] https://lists.ubuntu.com/archives/kernel-team/2022-May/130378.html [1] https://lists.ubuntu.com/archives/kernel-team/2022-May/130379.html [2] https://lists.ubuntu.com/archives/kernel-team/2022-May/130380.html -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-azure in Ubuntu. https://bugs.launchpad.net/bugs/1973758 Title: Azure: Mellanox VF NIC crashes when removed Status in linux-azure package in Ubuntu: Invalid Status in linux-azure source package in Focal: In Progress Bug description: SRU Justification [Impact] The 5.4.0-1075-azure and newer kernels are broken in that the VM can easily panic when the Mellanox VF NIC is removed and added due to Azure host servicing events or the below manual "unbind/bind" test (here the GUID can be different in different VMs): for i in `seq 1 1000`; do cd /sys/bus/vmbus/drivers/hv_pci; echo abdc2107-402e-4704-8c88-c2b850696c3c > unbind; echo abdc2107-402e-4704-8c88-c2b850696c3c > bind; done A sample panic call-trace is: [ 107.359954] kernel BUG at /build/linux-azure-5.4-4I3kFs/linux-azure-5.4-5.4.0/mm/slub.c:4020! [ 107.363858] invalid opcode: 0000 [#1] SMP NOPTI [ 107.365870] CPU: 0 PID: 334 Comm: kworker/0:2 Not tainted 5.4.0-1077-azure #80~18.04.1-Ubuntu [ 107.369589] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018 [ 107.373811] Workqueue: events vmbus_onmessage_work [ 107.375909] RIP: 0010:kfree+0x1d2/0x240 … [ 107.413789] Call Trace: [ 107.414867] kobject_uevent_env+0x1b5/0x7e0 [ 107.416747] kobject_uevent+0xb/0x10 [ 107.418327] device_release_driver_internal+0x191/0x1c0 [ 107.420653] device_release_driver+0x12/0x20 [ 107.422523] bus_remove_device+0xe1/0x150 [ 107.424279] device_del+0x167/0x380 [ 107.425824] device_unregister+0x1a/0x60 [ 107.427536] vmbus_device_unregister+0x27/0x50 [ 107.429528] vmbus_onoffer_rescind+0x1d0/0x1f0 [ 107.431474] vmbus_onmessage+0x2c/0x70 [ 107.433104] vmbus_onmessage_work+0x22/0x30 [ 107.434919] process_one_work+0x209/0x400 [ 107.436661] worker_thread+0x34/0x40 It turns out there is a bug in https://git.launchpad.net/~canonical- kernel/ubuntu/+source/linux- azure/+git/bionic/commit/?id=16a3c750a78d8, which misses the second hunk of the upstream patch https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=877b911a5ba0. Please apply the below patch to fix the issue: --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -3653,7 +3653,7 @@ static int hv_pci_remove(struct hv_device *hdev) hv_put_dom_num(hbus->bridge->domain_nr); - free_page((unsigned long)hbus); + kfree(hbus); return ret; } BTW, please apply this patch as well (Note: this patch is not really required as it's only for error handling path, which is usually unlikely): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=42c3d41832ef4fcf60aaa6f748de01ad99572adf [Test Case] Microsoft tested To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1973758/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp