** Description changed:

+ [ Impact ]
+ 
+ s390/pci: Don't abort recovery for user-space drivers
+ 
+ When a PCI device under the control of a vfio-pci based user-space
+ driver encounters a PCI error event the subsequent error recovery flow
+ in the kernel is aborted because the vfio-pci driver only implements the
+ error_detected PCI error handler callback. This leaves the PCI device in
+ the error state requiring unbinding/re-binding of the driver to get it
+ operational again instead of only having to re-init the user-space
+ driver.
+ 
+ According to the kernel documentation implementing only the
+ error_detected() callback from the error handling operations should be
+ enough for minimal recovery support. Contrary to this s390 so far
+ required also the reset_slot() and resume() callbacks to be implemented,
+ otherwise recovery would be aborted.
+ 
+ Remove the requirement for the additional operations bringing s390 in
+ line with AER and EEH error recovery flows.
+ 
+ [ Fix ]
+ 
+ Backport the following commit from upstream:
+ - 62355f1f87b8 s390/pci: Allow automatic recovery with minimal driver support
+ 
+ [ Test Plan ]
+ 
+ Bind a PCI device to vfio-pci.
+ Start a user-space workload using the device.
+ Use the s390 PCI error injection interface to trigger a recoverable PCI error.
+ Observe kernel logs (dmesg) and confirm that the vfio-pci driver’s 
error_detected() callback is invoked and recovery proceeds without abort.
+ After recovery, check that the device is functional again in the guest or 
user-space application without requiring manual unbind/rebind.
+ 
+ [ Regression Potential ]
+ 
+ The fix affects how the s390 PCI error handler interprets missing callbacks 
and the PCI_ERS_RESULT_NONE return code.
+ A bug here could cause the recovery flow to proceed when it should have 
aborted, or to treat driver abstention as successful recovery even in faulty 
situations.
+ Users may see PCI devices reported as recovered but remaining non-functional, 
recovery loops that repeatedly attempt to re-enable or reset devices, or 
devices silently failing I/O without triggering the expected operator 
intervention.
+ 
+ ---
+ 
  Description:   s390/pci: Don't abort recovery for user-space drivers
  
- Symptom:       
+ Symptom:
  When a PCI device under the control of a vfio-pci based user-space driver 
encounters a PCI error event the subsequent error recovery flow in the kernel 
is aborted because the vfio-pci driver only implements the error_detected PCI 
error handler callback. This leaves the PCI device in the error state requiring 
unbinding/re-binding of the driver to get it operational again instead of only 
having to re-init the user-space driver.
  
- Problem:       
+ Problem:
  According to the kernel documentation implementing only the error_detected() 
callback from the error handling operations should be enough for minimal 
recovery support. Contrary to this s390 so far required also the reset_slot() 
and resume() callbacks to be implemented, otherwise recovery would be aborted.
  
- Solution:      
+ Solution:
  Remove the requirement for the additional operations bringing s390 in line 
with AER and EEH error recovery flows.
  
- Reproduction:  
+ Reproduction:
  The problem can be reproduced with any user-space PCI driver such as the NVMe 
user-space driver built into QEMU
  
- Required Fix / Upstream-ID:   
+ Required Fix / Upstream-ID:
  62355f1f87b8c7f8785a8dd3cd5ca6e5b513566a

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2121150

Title:
  [UBUNTU 24.04] s390/pci: Don't abort recovery for user-space drivers

Status in Ubuntu on IBM z Systems:
  In Progress
Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Noble:
  In Progress
Status in linux source package in Plucky:
  In Progress

Bug description:
  [ Impact ]

  s390/pci: Don't abort recovery for user-space drivers

  When a PCI device under the control of a vfio-pci based user-space
  driver encounters a PCI error event the subsequent error recovery flow
  in the kernel is aborted because the vfio-pci driver only implements
  the error_detected PCI error handler callback. This leaves the PCI
  device in the error state requiring unbinding/re-binding of the driver
  to get it operational again instead of only having to re-init the
  user-space driver.

  According to the kernel documentation implementing only the
  error_detected() callback from the error handling operations should be
  enough for minimal recovery support. Contrary to this s390 so far
  required also the reset_slot() and resume() callbacks to be
  implemented, otherwise recovery would be aborted.

  Remove the requirement for the additional operations bringing s390 in
  line with AER and EEH error recovery flows.

  [ Fix ]

  Backport the following commit from upstream:
  - 62355f1f87b8 s390/pci: Allow automatic recovery with minimal driver support

  [ Test Plan ]

  Bind a PCI device to vfio-pci.
  Start a user-space workload using the device.
  Use the s390 PCI error injection interface to trigger a recoverable PCI error.
  Observe kernel logs (dmesg) and confirm that the vfio-pci driver’s 
error_detected() callback is invoked and recovery proceeds without abort.
  After recovery, check that the device is functional again in the guest or 
user-space application without requiring manual unbind/rebind.

  [ Regression Potential ]

  The fix affects how the s390 PCI error handler interprets missing callbacks 
and the PCI_ERS_RESULT_NONE return code.
  A bug here could cause the recovery flow to proceed when it should have 
aborted, or to treat driver abstention as successful recovery even in faulty 
situations.
  Users may see PCI devices reported as recovered but remaining non-functional, 
recovery loops that repeatedly attempt to re-enable or reset devices, or 
devices silently failing I/O without triggering the expected operator 
intervention.

  ---

  Description:   s390/pci: Don't abort recovery for user-space drivers

  Symptom:
  When a PCI device under the control of a vfio-pci based user-space driver 
encounters a PCI error event the subsequent error recovery flow in the kernel 
is aborted because the vfio-pci driver only implements the error_detected PCI 
error handler callback. This leaves the PCI device in the error state requiring 
unbinding/re-binding of the driver to get it operational again instead of only 
having to re-init the user-space driver.

  Problem:
  According to the kernel documentation implementing only the error_detected() 
callback from the error handling operations should be enough for minimal 
recovery support. Contrary to this s390 so far required also the reset_slot() 
and resume() callbacks to be implemented, otherwise recovery would be aborted.

  Solution:
  Remove the requirement for the additional operations bringing s390 in line 
with AER and EEH error recovery flows.

  Reproduction:
  The problem can be reproduced with any user-space PCI driver such as the NVMe 
user-space driver built into QEMU

  Required Fix / Upstream-ID:
  62355f1f87b8c7f8785a8dd3cd5ca6e5b513566a

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/2121150/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to