Private bug reported:

In modern high-speed interconnects such as PCIe and CXL, timely and
accurate error reporting is essential for effective serviceability. In-
band error reporting refers to mechanisms where error information is
communicated over the same data path as functional traffic, enabling
faster detection and response without relying solely on out-of-band
channels.

MPRAS (Multi-Protocol RAS) in-band error reporting enables unified and
protocol-aware error signaling across PCIe/CXL fabrics. It allows
devices, switches, and endpoints to propagate error information (e.g.,
link errors, protocol violations, poison events) through the fabric to
the host in a standardized manner. This is particularly important in
complex topologies involving switches, multi-level fabrics, and shared
resources.

In the Linux kernel, current error reporting relies heavily on
mechanisms such as PCIe Advanced Error Reporting (AER), ACPI APEI, and
vendor-specific logs. However, support for in-band, fabric-level error
propagation mechanisms like MPRAS is limited or not fully standardized.
Enhancing support would improve observability, reduce detection latency,
and simplify debugging in large-scale deployments.

Feature Request:
Requested details to be enabled on OS:
  Enable support for in-band error reporting mechanisms (MPRAS) in PCIe/CXL 
subsystems. 
  Integrate MPRAS error events with PCIe AER, CXL RAS, and system logging 
frameworks. 
  Support unified error decoding across multiple protocols (PCIe, CXL.io, 
CXL.mem, CXL.cache). 
  Provide sysfs/debugfs interfaces for accessing in-band error logs and 
telemetry. 
  Enable propagation and aggregation of errors across switches and multi-level 
fabrics. 
  Support correlation of in-band errors with hardware components (device, 
link, switch). 
  Enable firmware-to-OS handoff of MPRAS capabilities and configuration. 
  Provide tools for debugging, validation, and fault injection of in-band 
error scenarios. 
  Ensure compatibility with PCIe Gen5/Gen6 and CXL 2.0/3.0 fabrics. 
  Document MPRAS workflows, configuration, and error interpretation guidelines.

Business Justification:
  Reduces error detection latency and improves response time. 
  Enhances serviceability in complex PCIe/CXL fabric deployments. 
  Simplifies debugging and root-cause analysis across multi-level topologies. 
  Provides unified error reporting across multiple protocols. 
  Improves operational efficiency for data center and hyperscale environments. 
  Aligns OS capabilities with next-generation fabric-level RAS mechanisms.

References:
  PCI-SIG PCIe Specifications (AER and RAS Enhancements) 
  CXL 2.0 / 3.0 Specifications (RAS and Fabric Error Handling) 
  Linux Kernel PCIe AER and CXL RAS Documentation 
  Industry Whitepapers on In-Band Error Reporting and Fabric Serviceability

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

** Information type changed from Public to Private

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2146673

Title:
  Request for RAS Serviceability Support – In-band Error Reporting
  (MPRAS)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146673/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to