Private bug reported:
Compute Express Link (CXL) enables shared and pooled memory through
CXL.mem, allowing multiple hosts and devices to access external memory
expanders. While this improves scalability and utilization, it
introduces challenges in maintaining availability when faults occur in
shared memory regions or along the CXL fabric.
CXL.mem isolation is a key RAS (Reliability, Availability,
Serviceability) capability that ensures faults (e.g., media errors, link
failures, poison propagation) are contained within affected memory
regions, devices, or paths without impacting the entire system or other
tenants. Isolation mechanisms include address range containment, poison
handling, device-level fencing, and dynamic removal of faulty regions
from the system memory map.
In the Linux kernel, CXL support (via subsystems such as cxl_core,
cxl_mem, and integration with memory hotplug and NUMA) enables basic
management of CXL memory devices. However, fine-grained isolation
capabilities for fault containment, especially in multi-tenant and
pooled memory environments, are still evolving. Enhancing OS support is
critical to ensure high availability and resilience in CXL-based
systems.
Feature request:
Requested details to be enabled on OS:
Enable fine-grained isolation of faulty CXL.mem regions (range-based
isolation).
Support poison detection, containment, and controlled propagation handling.
Integrate CXL.mem errors with EDAC and RAS frameworks.
Enable dynamic offlining/removal of affected memory regions (memory
hot-remove).
Support device-level isolation (fencing faulty CXL devices or links).
Provide sysfs/debugfs interfaces for monitoring isolation events and memory
health.
Enable coordination with firmware for error containment and recovery workflows.
Support multi-tenant isolation in shared memory pool environments.
Integrate with NUMA and memory tiering for workload-aware isolation and
migration.
Provide tools for fault injection, validation, and debugging of isolation
mechanisms.
Document isolation policies, workflows, and best practices for CXL deployments.
Business Justification:
Improves system availability by isolating faults without full system
downtime.
Enables safe operation of shared and pooled memory environments.
Reduces impact of memory and link failures on running workloads.
Supports multi-tenant cloud and hyperscale deployments.
Enhances resilience and fault tolerance in CXL-based architectures.
Aligns OS capabilities with advanced RAS requirements for disaggregated
memory systems.
References:
CXL 2.0 / 3.0 Specifications (CXL.mem, RAS, Poison Handling)
Linux Kernel CXL Subsystem Documentation
Linux Memory Hotplug and NUMA Documentation
Industry Whitepapers on Memory Disaggregation and High-Availability Systems
** Affects: linux (Ubuntu)
Importance: Undecided
Status: New
** Information type changed from Public to Private
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2146672
Title:
Request for RAS Availability Support – CXL.mem Isolation
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146672/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs