Hi Jonathan, On 11/18/25 8:47 PM, Jonathan Cameron wrote:
On Thu, 13 Nov 2025 03:25:27 +1000 Gavin Shan <[email protected]> wrote:In the combination of 64KiB host and 4KiB guest, a problematic host page affects 16x guest pages. Those 16x guest pages are most likely owned by separate threads and accessed by the threads in parallel. It means 16x memory errors can be raised at once. However, we're unable to handle this situation because the only error source has one read acknowledgement register in current design. QEMU has to crash in the following path due to the previously delivered error isn't acknowledged by the guest on attempt to deliver another error. kvm_vcpu_thread_fn kvm_cpu_exec kvm_arch_on_sigbus_vcpu kvm_cpu_synchronize_state acpi_ghes_memory_errors abort This series fixes the issue by sending 16x consective CPER errors which are contained in a single GHES error block. PATCH[1-4] Increases GHES raw data maximal length from 1KiB to 4KiB PATCH[5] Supports multiple error records in a single error block PATCH[6-7] Improves the error handling in the error delivery path PATCH[8] Sends 16x consective CPERs in a single block if neededHi Gavin, Just a quick head's up to say we've had some internal discussions around the kernel handling of broader address masks in CPER and think it is probably broken. Rectifying that may at least simplify what is needed on the QEMU side of things and maybe even handle much larger blocks (2M and larger). Will keep everyone informed of how we get on with resolving that.
Thanks, Jonathan. If the broader address mask in CPER can be used to isolate the specified memory range instead of a page, QEMU needn't the improvement done in this series. Please copy me if the linux patches are going to be sent for review if possible, I will try to review. I will pull those patches improving error handling and post them separately so that they can be merged. Those patches aren't really relevant to error handling. Thanks, Gavin
