** Description changed:

+ [ Impact ]
+ 
+ s390/pci: Don't abort recovery for user-space drivers
+ 
+ When a PCI device under the control of a vfio-pci based user-space
+ driver encounters a PCI error event the subsequent error recovery flow
+ in the kernel is aborted because the vfio-pci driver only implements the
+ error_detected PCI error handler callback. This leaves the PCI device in
+ the error state requiring unbinding/re-binding of the driver to get it
+ operational again instead of only having to re-init the user-space
+ driver.
+ 
+ According to the kernel documentation implementing only the
+ error_detected() callback from the error handling operations should be
+ enough for minimal recovery support. Contrary to this s390 so far
+ required also the reset_slot() and resume() callbacks to be implemented,
+ otherwise recovery would be aborted.
+ 
+ Remove the requirement for the additional operations bringing s390 in
+ line with AER and EEH error recovery flows.
+ 
+ [ Fix ]
+ 
+ Backport the following commit from upstream:
+ - 62355f1f87b8 s390/pci: Allow automatic recovery with minimal driver support
+ 
+ [ Test Plan ]
+ 
+ Bind a PCI device to vfio-pci.
+ Start a user-space workload using the device.
+ Use the s390 PCI error injection interface to trigger a recoverable PCI error.
+ Observe kernel logs (dmesg) and confirm that the vfio-pci driver’s 
error_detected() callback is invoked and recovery proceeds without abort.
+ After recovery, check that the device is functional again in the guest or 
user-space application without requiring manual unbind/rebind.
+ 
+ [ Regression Potential ]
+ 
+ The fix affects how the s390 PCI error handler interprets missing callbacks 
and the PCI_ERS_RESULT_NONE return code.
+ A bug here could cause the recovery flow to proceed when it should have 
aborted, or to treat driver abstention as successful recovery even in faulty 
situations.
+ Users may see PCI devices reported as recovered but remaining non-functional, 
recovery loops that repeatedly attempt to re-enable or reset devices, or 
devices silently failing I/O without triggering the expected operator 
intervention.
+ 
+ ---
+ 
  Description:   s390/pci: Don't abort recovery for user-space drivers
  
- Symptom:       
+ Symptom:
  When a PCI device under the control of a vfio-pci based user-space driver 
encounters a PCI error event the subsequent error recovery flow in the kernel 
is aborted because the vfio-pci driver only implements the error_detected PCI 
error handler callback. This leaves the PCI device in the error state requiring 
unbinding/re-binding of the driver to get it operational again instead of only 
having to re-init the user-space driver.
  
- Problem:       
+ Problem:
  According to the kernel documentation implementing only the error_detected() 
callback from the error handling operations should be enough for minimal 
recovery support. Contrary to this s390 so far required also the reset_slot() 
and resume() callbacks to be implemented, otherwise recovery would be aborted.
  
- Solution:      
+ Solution:
  Remove the requirement for the additional operations bringing s390 in line 
with AER and EEH error recovery flows.
  
- Reproduction:  
+ Reproduction:
  The problem can be reproduced with any user-space PCI driver such as the NVMe 
user-space driver built into QEMU
  
- Required Fix / Upstream-ID:   
+ Required Fix / Upstream-ID:
  62355f1f87b8c7f8785a8dd3cd5ca6e5b513566a

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2121150

Title:
  [UBUNTU 24.04] s390/pci: Don't abort recovery for user-space drivers

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2121150/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to