** Description changed: + [ Impact ] + + s390/pci: Don't abort recovery for user-space drivers + + When a PCI device under the control of a vfio-pci based user-space + driver encounters a PCI error event the subsequent error recovery flow + in the kernel is aborted because the vfio-pci driver only implements the + error_detected PCI error handler callback. This leaves the PCI device in + the error state requiring unbinding/re-binding of the driver to get it + operational again instead of only having to re-init the user-space + driver. + + According to the kernel documentation implementing only the + error_detected() callback from the error handling operations should be + enough for minimal recovery support. Contrary to this s390 so far + required also the reset_slot() and resume() callbacks to be implemented, + otherwise recovery would be aborted. + + Remove the requirement for the additional operations bringing s390 in + line with AER and EEH error recovery flows. + + [ Fix ] + + Backport the following commit from upstream: + - 62355f1f87b8 s390/pci: Allow automatic recovery with minimal driver support + + [ Test Plan ] + + Bind a PCI device to vfio-pci. + Start a user-space workload using the device. + Use the s390 PCI error injection interface to trigger a recoverable PCI error. + Observe kernel logs (dmesg) and confirm that the vfio-pci driver’s error_detected() callback is invoked and recovery proceeds without abort. + After recovery, check that the device is functional again in the guest or user-space application without requiring manual unbind/rebind. + + [ Regression Potential ] + + The fix affects how the s390 PCI error handler interprets missing callbacks and the PCI_ERS_RESULT_NONE return code. + A bug here could cause the recovery flow to proceed when it should have aborted, or to treat driver abstention as successful recovery even in faulty situations. + Users may see PCI devices reported as recovered but remaining non-functional, recovery loops that repeatedly attempt to re-enable or reset devices, or devices silently failing I/O without triggering the expected operator intervention. + + --- + Description: s390/pci: Don't abort recovery for user-space drivers - Symptom: + Symptom: When a PCI device under the control of a vfio-pci based user-space driver encounters a PCI error event the subsequent error recovery flow in the kernel is aborted because the vfio-pci driver only implements the error_detected PCI error handler callback. This leaves the PCI device in the error state requiring unbinding/re-binding of the driver to get it operational again instead of only having to re-init the user-space driver. - Problem: + Problem: According to the kernel documentation implementing only the error_detected() callback from the error handling operations should be enough for minimal recovery support. Contrary to this s390 so far required also the reset_slot() and resume() callbacks to be implemented, otherwise recovery would be aborted. - Solution: + Solution: Remove the requirement for the additional operations bringing s390 in line with AER and EEH error recovery flows. - Reproduction: + Reproduction: The problem can be reproduced with any user-space PCI driver such as the NVMe user-space driver built into QEMU - Required Fix / Upstream-ID: + Required Fix / Upstream-ID: 62355f1f87b8c7f8785a8dd3cd5ca6e5b513566a
-- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2121150 Title: [UBUNTU 24.04] s390/pci: Don't abort recovery for user-space drivers Status in Ubuntu on IBM z Systems: In Progress Status in linux package in Ubuntu: Invalid Status in linux source package in Noble: In Progress Status in linux source package in Plucky: In Progress Bug description: [ Impact ] s390/pci: Don't abort recovery for user-space drivers When a PCI device under the control of a vfio-pci based user-space driver encounters a PCI error event the subsequent error recovery flow in the kernel is aborted because the vfio-pci driver only implements the error_detected PCI error handler callback. This leaves the PCI device in the error state requiring unbinding/re-binding of the driver to get it operational again instead of only having to re-init the user-space driver. According to the kernel documentation implementing only the error_detected() callback from the error handling operations should be enough for minimal recovery support. Contrary to this s390 so far required also the reset_slot() and resume() callbacks to be implemented, otherwise recovery would be aborted. Remove the requirement for the additional operations bringing s390 in line with AER and EEH error recovery flows. [ Fix ] Backport the following commit from upstream: - 62355f1f87b8 s390/pci: Allow automatic recovery with minimal driver support [ Test Plan ] Bind a PCI device to vfio-pci. Start a user-space workload using the device. Use the s390 PCI error injection interface to trigger a recoverable PCI error. Observe kernel logs (dmesg) and confirm that the vfio-pci driver’s error_detected() callback is invoked and recovery proceeds without abort. After recovery, check that the device is functional again in the guest or user-space application without requiring manual unbind/rebind. [ Regression Potential ] The fix affects how the s390 PCI error handler interprets missing callbacks and the PCI_ERS_RESULT_NONE return code. A bug here could cause the recovery flow to proceed when it should have aborted, or to treat driver abstention as successful recovery even in faulty situations. Users may see PCI devices reported as recovered but remaining non-functional, recovery loops that repeatedly attempt to re-enable or reset devices, or devices silently failing I/O without triggering the expected operator intervention. --- Description: s390/pci: Don't abort recovery for user-space drivers Symptom: When a PCI device under the control of a vfio-pci based user-space driver encounters a PCI error event the subsequent error recovery flow in the kernel is aborted because the vfio-pci driver only implements the error_detected PCI error handler callback. This leaves the PCI device in the error state requiring unbinding/re-binding of the driver to get it operational again instead of only having to re-init the user-space driver. Problem: According to the kernel documentation implementing only the error_detected() callback from the error handling operations should be enough for minimal recovery support. Contrary to this s390 so far required also the reset_slot() and resume() callbacks to be implemented, otherwise recovery would be aborted. Solution: Remove the requirement for the additional operations bringing s390 in line with AER and EEH error recovery flows. Reproduction: The problem can be reproduced with any user-space PCI driver such as the NVMe user-space driver built into QEMU Required Fix / Upstream-ID: 62355f1f87b8c7f8785a8dd3cd5ca6e5b513566a To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-z-systems/+bug/2121150/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : [email protected] Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp

