Private bug reported:

Off-package interconnects (e.g., PCIe, CXL, and other high-speed SerDes-based 
links) are increasingly critical in modern platforms, enabling communication 
between CPUs, accelerators, memory expanders, and I/O devices. With higher data 
rates (e.g., PCIe Gen5/Gen6), signal integrity challenges increase, leading to 
higher bit error rates.
To ensure reliable communication, these links implement multiple layers of 
error detection and correction mechanisms:
FEC (Forward Error Correction): Corrects bit errors at the physical layer 
without retransmission. 
CRC (Cyclic Redundancy Check): Detects data corruption at the data link layer. 
Replay Mechanism: Retransmits corrupted packets when CRC detects errors that 
cannot be corrected. 
These mechanisms work together to provide robust data integrity and minimize 
data loss across off-package links. While largely handled in hardware, they 
generate error events and telemetry that are essential for system-level RAS, 
diagnostics, and performance tuning.

In the Linux kernel, existing support includes PCIe Advanced Error
Reporting (AER) and basic link error handling. However, detailed
visibility into FEC corrections, CRC errors, and replay events is
limited or vendor-specific. Enhancing OS-level support would improve
observability, proactive fault management, and reliability in high-speed
interconnect environments.

Feature Request: 
Requested details to be enabled on OS:
  Extend PCIe/CXL error reporting to include FEC correction statistics and 
thresholds. 
  Enhance AER framework to capture CRC error counts and replay events. 
  Provide standardized interfaces (sysfs/debugfs) for link health monitoring. 
  Enable per-link telemetry for error rates, replay counts, and correction 
activity. 
  Integrate link error data with RAS frameworks and system logging. 
  Support proactive fault management (e.g., link retraining, degradation 
alerts). 
  Enable firmware-to-OS handoff of link reliability metrics and thresholds. 
  Ensure compatibility with PCIe Gen5/Gen6 and CXL link features. 
  Provide tools for debugging and validating link reliability issues. 
  Document interpretation of FEC/CRC/replay metrics and recommended actions.

Business Justification:
  Improves reliability of high-speed interconnects under increasing data 
rates. 
  Enables early detection of signal integrity issues and hardware degradation. 
  Reduces risk of data corruption and system instability. 
  Supports mission-critical and high-performance workloads. 
  Enhances observability and diagnostics for platform validation and support 
teams. 
  Aligns OS capabilities with advanced link-level RAS features in modern 
hardware.

References:
  PCI-SIG PCIe Gen5/Gen6 Specifications (FEC, CRC, Replay Mechanisms) 
  CXL 2.0 / 3.0 Specifications 
  Linux PCIe AER Documentation 
  High-Speed SerDes and Link Reliability Whitepapers

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

** Information type changed from Public to Private

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2146665

Title:
  Request for RAS Reliability Support – Off-Package Links FEC + CRC +
  Replay

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146665/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to