** Description changed: + [ Impact ] + + s390/pci: Don't abort recovery for user-space drivers + + When a PCI device under the control of a vfio-pci based user-space + driver encounters a PCI error event the subsequent error recovery flow + in the kernel is aborted because the vfio-pci driver only implements the + error_detected PCI error handler callback. This leaves the PCI device in + the error state requiring unbinding/re-binding of the driver to get it + operational again instead of only having to re-init the user-space + driver. + + According to the kernel documentation implementing only the + error_detected() callback from the error handling operations should be + enough for minimal recovery support. Contrary to this s390 so far + required also the reset_slot() and resume() callbacks to be implemented, + otherwise recovery would be aborted. + + Remove the requirement for the additional operations bringing s390 in + line with AER and EEH error recovery flows. + + [ Fix ] + + Backport the following commit from upstream: + - 62355f1f87b8 s390/pci: Allow automatic recovery with minimal driver support + + [ Test Plan ] + + Bind a PCI device to vfio-pci. + Start a user-space workload using the device. + Use the s390 PCI error injection interface to trigger a recoverable PCI error. + Observe kernel logs (dmesg) and confirm that the vfio-pci driver’s error_detected() callback is invoked and recovery proceeds without abort. + After recovery, check that the device is functional again in the guest or user-space application without requiring manual unbind/rebind. + + [ Regression Potential ] + + The fix affects how the s390 PCI error handler interprets missing callbacks and the PCI_ERS_RESULT_NONE return code. + A bug here could cause the recovery flow to proceed when it should have aborted, or to treat driver abstention as successful recovery even in faulty situations. + Users may see PCI devices reported as recovered but remaining non-functional, recovery loops that repeatedly attempt to re-enable or reset devices, or devices silently failing I/O without triggering the expected operator intervention. + + --- + Description: s390/pci: Don't abort recovery for user-space drivers - Symptom: + Symptom: When a PCI device under the control of a vfio-pci based user-space driver encounters a PCI error event the subsequent error recovery flow in the kernel is aborted because the vfio-pci driver only implements the error_detected PCI error handler callback. This leaves the PCI device in the error state requiring unbinding/re-binding of the driver to get it operational again instead of only having to re-init the user-space driver. - Problem: + Problem: According to the kernel documentation implementing only the error_detected() callback from the error handling operations should be enough for minimal recovery support. Contrary to this s390 so far required also the reset_slot() and resume() callbacks to be implemented, otherwise recovery would be aborted. - Solution: + Solution: Remove the requirement for the additional operations bringing s390 in line with AER and EEH error recovery flows. - Reproduction: + Reproduction: The problem can be reproduced with any user-space PCI driver such as the NVMe user-space driver built into QEMU - Required Fix / Upstream-ID: + Required Fix / Upstream-ID: 62355f1f87b8c7f8785a8dd3cd5ca6e5b513566a
-- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2121150 Title: [UBUNTU 24.04] s390/pci: Don't abort recovery for user-space drivers To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-z-systems/+bug/2121150/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
