Private bug reported:

Core error containment is a critical RAS (Reliability, Availability,
Serviceability) capability that ensures faults occurring within a CPU
core (e.g., execution units, pipelines, registers, or private caches)
are detected, isolated, and contained without propagating to other cores
or system components. As core counts increase and workloads become more
distributed, isolating faults at the core level is essential to maintain
system stability and prevent widespread failures.

Modern processors implement hardware mechanisms such as Machine Check
Architecture (MCA/SMCA), core-level parity/ECC protection, fault
detection, and recovery flows (e.g., instruction replay, pipeline flush,
core fencing). When unrecoverable errors occur, the system may offline
the affected core while allowing the rest of the system to continue
operation.

In the Linux kernel, support exists for handling CPU errors via MCA,
ACPI APEI, and RAS frameworks. However, enhancements are needed to
improve granularity, automation, and observability of core-level fault
containment and recovery mechanisms, especially for next-generation
high-core-count systems.

Feature Request:
Requested details to be enabled on OS:
  Enhance MCA/SMCA decoding for detailed core-level error classification. 
  Enable precise identification of faulty cores (core/thread granularity). 
  Integrate core error events with RAS and system logging frameworks. 
  Support automatic core offlining for unrecoverable errors. 
  Enable recovery mechanisms such as task migration and workload 
redistribution. 
  Provide sysfs/debugfs interfaces for monitoring core health and error 
statistics. 
  Support firmware-to-OS handoff of core error containment capabilities and 
thresholds. 
  Improve handling of corrected vs uncorrected core errors. 
  Enable correlation of core errors with cache, memory, and interconnect 
events. 
  Provide tools for diagnostics, validation, and failure analysis. 
  Document core error handling policies and recommended mitigation strategies.

Buisness Justification:
  Improves system availability by isolating faults to individual cores. 
  Reduces impact of hardware failures on running workloads. 
  Enables graceful degradation instead of full system crashes. 
  Supports high core-count CPUs used in enterprise and hyperscale systems. 
  Enhances observability and root-cause analysis for CPU-related failures. 
  Aligns OS capabilities with advanced CPU RAS features.

References:
  CPU Vendor Documentation (e.g., AMD SMCA, Intel MCA guides) 
  Linux Kernel RAS and MCA/APEI Documentation 
  ACPI Platform Error Interface (APEI) Specification 
  Industry Whitepapers on CPU Reliability and Fault Containment

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

** Information type changed from Public to Private

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2146671

Title:
  Request for RAS Reliability Support – Core Error Containment

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146671/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to