Private bug reported:
Out-of-Band (OOB) RAS mechanisms provide an independent path for error
reporting and system health monitoring, separate from the main
data/control planes of the system. These mechanisms are critical for
detecting and reporting faults even when the host OS is unresponsive or
compromised.
APML (Advanced Platform Management Link) is a vendor-specific interface
(commonly used in AMD platforms) that enables communication between the
platform management controller (e.g., BMC) and the CPU. APML Async Alert
is an asynchronous notification mechanism that allows the CPU to
proactively signal critical events (e.g., thermal excursions, power
anomalies, RAS faults) to the management controller without polling.
This OOB alerting capability enhances serviceability by ensuring that
critical faults are captured and acted upon even if in-band mechanisms
(e.g., PCIe AER, OS logging) are unavailable. It is especially important
for data center environments where remote management and rapid fault
response are essential.
In the Linux kernel, OOB RAS handling is typically mediated through BMC
interfaces and user-space tools (e.g., IPMI/Redfish). However,
integration and visibility of APML Async Alert events within the OS and
system management stack are limited. Enhancing support would enable
better coordination between OOB and in-band RAS mechanisms.
Feature Request:
Requested details to be enabled on OS:
Enable support for APML Async Alert event reception and handling.
Integrate APML alerts with system RAS frameworks and logging infrastructure.
Provide interfaces (e.g., sysfs, netlink, or daemon integration) to expose
OOB alerts to user space.
Enable correlation of APML alerts with in-band error events (MCA, AER,
EDAC).
Support BMC-to-OS communication for synchronized fault reporting.
Provide drivers or interfaces for APML communication where applicable.
Enable policy-based actions based on APML alerts (e.g., throttling,
shutdown, alerting).
Support integration with management frameworks (e.g., IPMI, Redfish).
Provide tools for monitoring, testing, and validating APML alert flows.
Document APML Async Alert behavior, configuration, and usage workflows.
Business Justification:
Ensures critical faults are reported even when OS is unresponsive.
Improves serviceability and remote management capabilities.
Reduces mean time to detect (MTTD) and respond (MTTR) to failures.
Enhances coordination between platform firmware, BMC, and OS.
Supports enterprise and hyperscale data center operational requirements.
Aligns with modern OOB management and RAS strategies.
References:
AMD APML (Advanced Platform Management Link) Documentation
Platform Management (BMC, IPMI, Redfish) Specifications
Linux Kernel Hardware Monitoring and Management Subsystems
Industry Whitepapers on Out-of-Band RAS and System Management
** Affects: linux (Ubuntu)
Importance: Undecided
Status: New
** Information type changed from Public to Private
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2146674
Title:
Request for RAS Serviceability Support – Out-of-Band (OOB) RAS with
APML Async Alert
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146674/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs