Private bug reported:

On-package interconnects (e.g., CPU die-to-die links, chiplet
interconnects such as xGMI, Infinity Fabric, UPI, or similar proprietary
fabrics) are critical for communication between components within a
single package. These links operate at very high speeds and low
latencies, making reliability essential for correct system operation.

On-package link parity is a lightweight error detection mechanism that
adds parity bits to data transfers across these internal links. It
enables detection of single-bit errors occurring due to signal integrity
issues, transient faults, or silicon variations. Upon detection,
hardware mechanisms may trigger retries, error logging, or escalation
via Machine Check Architecture (MCA).

Although these mechanisms are implemented in hardware, error events and
telemetry can be surfaced to the OS for diagnostics and RAS handling. In
the Linux kernel, such errors are typically reported via MCA/SMCA, ACPI
APEI, or vendor-specific drivers. However, visibility into on-package
link parity errors is often limited, inconsistent, or lacks
standardization across platforms.

Enhancing OS-level support for on-package link parity reporting and
handling would improve fault isolation, debugging, and overall platform
reliability, especially in chiplet-based architectures.

Feature Request:
Requested details to be enabled on OS:
  Enhance MCA/SMCA decoding to include on-package link parity error 
classification. 
  Integrate parity error reporting with RAS and EDAC subsystems. 
  Provide visibility into parity error counts, locations (link/segment), and 
severity. 
  Enable sysfs/debugfs interfaces for monitoring on-package link health. 
  Support proactive mitigation (e.g., link retraining, throttling, core 
offlining). 
  Enable firmware-to-OS handoff of link parity error telemetry and thresholds. 
  Correlate on-package link errors with CPU, memory, and I/O subsystem events. 
  Improve logging and tracing for transient and persistent parity errors. 
  Provide tools for debugging and validation of on-package interconnect 
reliability. 
  Document error types, thresholds, and recommended mitigation strategies.

Business Justification:
  Improves reliability of chiplet-based and multi-die processor architectures. 
  Enables early detection of internal interconnect degradation or instability. 
  Reduces system crashes and silent data corruption risks. 
  Enhances observability and diagnostics for platform validation teams. 
  Supports mission-critical workloads requiring high availability. 
  Aligns OS capabilities with modern on-package interconnect RAS features.

References:
  CPU Vendor Documentation (e.g., AMD Infinity Fabric / xGMI, Intel UPI RAS 
guides) 
  Linux Kernel RAS, MCA/SMCA, and EDAC Subsystem Documentation 
  ACPI Platform Error Interface (APEI) Specification 
  Industry Whitepapers on Chiplet Architectures and On-Package Interconnect 
Reliability

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

** Information type changed from Public to Private

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2146667

Title:
  Request for RAS Reliability Support – On-Package Link Parity

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146667/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to