** Changed in: ubuntu-z-systems
Status: In Progress => Fix Committed
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2121150
Title:
[UBUNTU 24.04] s390/pci: Don't abort recovery for user-space drivers
Status in Ubuntu on IBM z Systems:
Fix Committed
Status in linux package in Ubuntu:
Invalid
Status in linux source package in Noble:
Fix Committed
Status in linux source package in Plucky:
Fix Committed
Bug description:
[ Impact ]
s390/pci: Don't abort recovery for user-space drivers
When a PCI device under the control of a vfio-pci based user-space
driver encounters a PCI error event the subsequent error recovery flow
in the kernel is aborted because the vfio-pci driver only implements
the error_detected PCI error handler callback. This leaves the PCI
device in the error state requiring unbinding/re-binding of the driver
to get it operational again instead of only having to re-init the
user-space driver.
According to the kernel documentation implementing only the
error_detected() callback from the error handling operations should be
enough for minimal recovery support. Contrary to this s390 so far
required also the reset_slot() and resume() callbacks to be
implemented, otherwise recovery would be aborted.
Remove the requirement for the additional operations bringing s390 in
line with AER and EEH error recovery flows.
[ Fix ]
Backport the following commit from upstream:
- 62355f1f87b8 s390/pci: Allow automatic recovery with minimal driver support
[ Test Plan ]
Bind a PCI device to vfio-pci.
Start a user-space workload using the device.
Use the s390 PCI error injection interface to trigger a recoverable PCI error.
Observe kernel logs (dmesg) and confirm that the vfio-pci driver’s
error_detected() callback is invoked and recovery proceeds without abort.
After recovery, check that the device is functional again in the guest or
user-space application without requiring manual unbind/rebind.
[ Regression Potential ]
The fix affects how the s390 PCI error handler interprets missing callbacks
and the PCI_ERS_RESULT_NONE return code.
A bug here could cause the recovery flow to proceed when it should have
aborted, or to treat driver abstention as successful recovery even in faulty
situations.
Users may see PCI devices reported as recovered but remaining non-functional,
recovery loops that repeatedly attempt to re-enable or reset devices, or
devices silently failing I/O without triggering the expected operator
intervention.
---
Description: s390/pci: Don't abort recovery for user-space drivers
Symptom:
When a PCI device under the control of a vfio-pci based user-space driver
encounters a PCI error event the subsequent error recovery flow in the kernel
is aborted because the vfio-pci driver only implements the error_detected PCI
error handler callback. This leaves the PCI device in the error state requiring
unbinding/re-binding of the driver to get it operational again instead of only
having to re-init the user-space driver.
Problem:
According to the kernel documentation implementing only the error_detected()
callback from the error handling operations should be enough for minimal
recovery support. Contrary to this s390 so far required also the reset_slot()
and resume() callbacks to be implemented, otherwise recovery would be aborted.
Solution:
Remove the requirement for the additional operations bringing s390 in line
with AER and EEH error recovery flows.
Reproduction:
The problem can be reproduced with any user-space PCI driver such as the NVMe
user-space driver built into QEMU
Required Fix / Upstream-ID:
62355f1f87b8c7f8785a8dd3cd5ca6e5b513566a
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2121150/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp