Private bug reported:
On-package interconnects (e.g., CPU die-to-die links, chiplet
interconnects such as xGMI, Infinity Fabric, UPI, or similar proprietary
fabrics) are critical for communication between components within a
single package. These links operate at very high speeds and low
latencies, making reliability essential for correct system operation.
On-package link parity is a lightweight error detection mechanism that
adds parity bits to data transfers across these internal links. It
enables detection of single-bit errors occurring due to signal integrity
issues, transient faults, or silicon variations. Upon detection,
hardware mechanisms may trigger retries, error logging, or escalation
via Machine Check Architecture (MCA).
Although these mechanisms are implemented in hardware, error events and
telemetry can be surfaced to the OS for diagnostics and RAS handling. In
the Linux kernel, such errors are typically reported via MCA/SMCA, ACPI
APEI, or vendor-specific drivers. However, visibility into on-package
link parity errors is often limited, inconsistent, or lacks
standardization across platforms.
Enhancing OS-level support for on-package link parity reporting and
handling would improve fault isolation, debugging, and overall platform
reliability, especially in chiplet-based architectures.
Feature Request:
Requested details to be enabled on OS:
Enhance MCA/SMCA decoding to include on-package link parity error
classification.
Integrate parity error reporting with RAS and EDAC subsystems.
Provide visibility into parity error counts, locations (link/segment), and
severity.
Enable sysfs/debugfs interfaces for monitoring on-package link health.
Support proactive mitigation (e.g., link retraining, throttling, core
offlining).
Enable firmware-to-OS handoff of link parity error telemetry and thresholds.
Correlate on-package link errors with CPU, memory, and I/O subsystem events.
Improve logging and tracing for transient and persistent parity errors.
Provide tools for debugging and validation of on-package interconnect
reliability.
Document error types, thresholds, and recommended mitigation strategies.
Business Justification:
Improves reliability of chiplet-based and multi-die processor architectures.
Enables early detection of internal interconnect degradation or instability.
Reduces system crashes and silent data corruption risks.
Enhances observability and diagnostics for platform validation teams.
Supports mission-critical workloads requiring high availability.
Aligns OS capabilities with modern on-package interconnect RAS features.
References:
CPU Vendor Documentation (e.g., AMD Infinity Fabric / xGMI, Intel UPI RAS
guides)
Linux Kernel RAS, MCA/SMCA, and EDAC Subsystem Documentation
ACPI Platform Error Interface (APEI) Specification
Industry Whitepapers on Chiplet Architectures and On-Package Interconnect
Reliability
** Affects: linux (Ubuntu)
Importance: Undecided
Status: New
** Information type changed from Public to Private
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2146667
Title:
Request for RAS Reliability Support – On-Package Link Parity
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146667/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs