Private bug reported:

PCIe Gen6 introduces a major architectural shift to support 64 GT/s data
rates, including PAM4 signaling and FLIT-based encoding. These changes
significantly increase the likelihood of bit errors compared to previous
generations, making robust error detection and correction mechanisms
essential for maintaining data integrity and link reliability.

To address this, PCIe Gen6 incorporates a combination of:
FEC (Forward Error Correction): Corrects bit errors at the physical layer in 
real time. 
CRC (Cyclic Redundancy Check): Detects data corruption at the FLIT/data link 
layer. 
Replay Mechanism: Retransmits corrupted FLITs when CRC detects uncorrectable 
errors.
 
These mechanisms operate together to ensure reliable communication across 
high-speed links while minimizing latency impact. Although largely implemented 
in hardware, they generate telemetry and error events that are critical for 
system-level RAS, diagnostics, and performance tuning.

In the Linux kernel, current PCIe support includes Advanced Error
Reporting (AER), but detailed visibility into Gen6-specific mechanisms
such as FEC correction rates, FLIT-level CRC errors, and replay counters
is limited. Enhancing OS support would provide better observability and
enable proactive management of link health in next-generation platforms.

Feature Request:

Requested details to be enabled on OS:

  Extend PCIe AER framework to support PCIe Gen6-specific error reporting 
(FEC, FLIT CRC, replay events). 
  Provide visibility into FEC correction counts and error rates per link. 
  Expose CRC error and replay counters via sysfs/debugfs interfaces. 
  Enable per-link health monitoring and threshold-based alerting. 
  Integrate Gen6 link error telemetry with RAS frameworks and system logging. 
  Support proactive mitigation (e.g., link retraining, speed downgrade, lane 
margining triggers). 
  Enable firmware-to-OS handoff of Gen6 link reliability metrics and 
thresholds. 
  Ensure compatibility with PCIe Gen6 retimers and switches. 
  Provide tools for debugging and validating high-speed link reliability. 
  Document interpretation of FEC/CRC/replay metrics and recommended actions.

Business Justification:
  Ensures reliable operation of PCIe Gen6 high-speed links. 
  Enables early detection of signal integrity and hardware degradation issues. 
  Reduces risk of data corruption and link failures in production systems. 
  Improves observability and diagnostics for platform validation and support 
teams. 
  Supports high-performance workloads requiring stable, high-bandwidth 
interconnects. 
  Aligns OS capabilities with next-generation PCIe Gen6 RAS requirements.

References:
  PCI-SIG PCIe Gen6 Specification (FEC, FLIT, CRC, Replay Mechanisms) 
  Linux PCIe AER Documentation 
  PCIe Gen6 Architecture Whitepapers 
  High-Speed SerDes and Link Reliability Documentation

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

** Information type changed from Public to Private

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2146670

Title:
  Request for RAS Reliability Support – PCIe Gen6 Link RAS Support

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146670/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to