This bug is awaiting verification that the linux-azure- fde-6.8/6.8.0-1041.48~22.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux- azure-fde-6.8' to 'verification-done-jammy-linux-azure-fde-6.8'. If the problem still exists, change the tag 'verification-needed-jammy-linux- azure-fde-6.8' to 'verification-failed-jammy-linux-azure-fde-6.8'.
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-jammy-linux-azure-fde-6.8-v2 verification-needed-jammy-linux-azure-fde-6.8 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2121149 Title: [UBUNTU 24.04] s390/pci: Fix stale function handles in error handling Status in Ubuntu on IBM z Systems: Fix Committed Status in linux package in Ubuntu: Invalid Status in linux source package in Noble: Fix Committed Status in linux source package in Plucky: Fix Committed Bug description: [ Impact ] s390/pci: Fix stale function handles in error handling In some error scenarios multiple error events may be generated for the same PCI function before Linux even started its automatic recovery process. In this case Linux may succeed to recover based on the first event but then fails recovery when handling a subsequent event. This is because events capture the function handle as they are created. At the time when the secondary event is handled the handle stored with the error event is then stale and using it to reset the function will fail. Fix this by retrieving a fresh function handle using the CLP List PCI Functions and only process events where the stored handle matches this handle. This effectively ignores error events which were captured before the latest disable/enable cycles. Relatedly if the current handle is already disabled do not attempt to simply reset the error state as a re-enable is necessary and clearing the error state would fail. [ Fix ] Backport the following commits from upstream: - 45537926dd2a s390/pci: Fix stale function handles in error handling - b97a7972b1f4 s390/pci: Do not try re-enabling load/store if device is disabled [ Test Plan ] Booting the system on a IBM Z mainframe with at least one PCI passthrough device available. Enable debug logging in order to monitor how error events are processed in real time. Trigger PCI error conditions, either through firmware error injection or by repeatedly disabling and re-enabling the device under load using sysfs interfaces. While the device is busy handling real traffic, such as network or crypto operations, watch the kernel logs to see how error events are processed. Verify that events carrying stale function handles are detected and ignored, and that recovery attempts against disabled devices escalate properly to a full reset. [ Regression Potential ] The fix affects how the s390 PCI error handler validates and uses function handles during recovery. A bug here could cause valid error events to be incorrectly ignored or recovery paths to escalate unnecessarily. Users may see PCI devices not recovering from transient errors, devices being reset or re-enabled more often than required, or even unexpected device removal. --- Description: s390/pci: Fix stale function handles in error handling Symptom: In some error scenarios automatic recovery may ultimately fail after Linux initially recovered successfully when it then tries to handle another error event. Problem: In some error scenarios multiple error events may be generated for the same PCI function before Linux even started its automatic recovery process. In this case Linux may succeed to recover based on the first event but then fails recovery when handling a subsequent event. This is because events capture the function handle as they are created. At the time when the secondary event is handled the handle stored with the error event is then stale and using it to reset the function will fail. Solution: Fix this by retrieving a fresh function handle using the CLP List PCI Functions and only process events where the stored handle matches this handle. This effectively ignores error events which were captured before the latest disable/enable cycles. Relatedly if the current handle is already disabled do not attempt to simply reset the error state as a re-enable is necessary and clearing the error state would fail. Reproduction: This may be reproduced in an artificial error scenario by issuing multiple zpcictl --reset-fw <dev> in quick succession generating multiple PEC 0x3A events with the same handle. Required Fixes / Upstream-IDs: 45537926dd2aaa9190ac0fac5a0fbeefcadfea95 b97a7972b1f4f81417840b9a2ab0c19722b577d5 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-z-systems/+bug/2121149/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : [email protected] Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp

