hmem, cxl: Defer and resolve ownership of Soft Reserved memory ranges

Tomasz Wolski Wed, 04 Feb 2026 15:31:41 -0800

>> > I was thinking through what Alison asked about what to do later in boot
>> > when other regions are being dynamically created. It made me wonder if
>> > this safety can be achieved more easily by just making sure that the
>> > alloc_dax_region() call fails.
>> 
>> Agreed with all the points above, including making alloc_dax_region() 
>> fail as the safety mechanism. This also cleanly avoids the no Soft 
>> Reserved case Alison pointed out, where dax_cxl_mode can remain stuck in 
>> DEFER and return -EPROBE_DEFER.
>> 
>> What I’m still trying to understand is the case of “other regions being 
>> dynamically created.” Once HMEM has claimed the relevant HPA range, any 
>> later userspace attempts to create regions (via cxl create-region) 
>> should naturally fail due to the existing HPA allocation. This already 
>> shows up as an HPA allocation failure currently.
>> 
>> #cxl create-region -d decoder0.0 -m mem2 -w 1 -g256
>> cxl region: create_region: region0: set_size failed: Numerical result 
>> out of range
>> cxl region: cmd_create_region: created 0 regions
>> 
>> And in the dmesg:
>> [  466.819353] alloc_hpa: cxl region0: HPA allocation error (-34) for 
>> size:0x0000002000000000 in CXL Window 0 [mem 0x850000000-0x284fffffff 
>> flags 0x200]
>> 
>> Also, at this point, with the probe-ordering fixes and the use of 
>> wait_for_device_probe(), region probing should have fully completed.
>> 
>> Am I missing any other scenario where regions could still be created 
>> dynamically beyond this?
>
>The concern is what to do about regions and memory devices that are
>completely innocent. So, for example imagine deviceA is handled by BIOS
>and deviceB is ignored by BIOS. If deviceB was ignored by BIOS then it
>would be rude to tear down any regions that might be established for
>deviceB. So if alloc_dax_region() exclusion and HPA space reservation
>prevent future collisions while not disturbing innocent devices, then I
>think userspace can pick up the pieces from there.


I'm trying to follow the idea of "deviceB being ignored by BIOS" 
Do you consider hot-plug devices and user creating reqions manually? 
Could you please describe such scenario?

>> > Something like (untested / incomplete, needs cleanup handling!)
>> > 
>> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
>> > index fde29e0ad68b..fd18343e0538 100644
>> > --- a/drivers/dax/bus.c
>> > +++ b/drivers/dax/bus.c
>> > @@ -10,6 +10,7 @@
>> >   #include "dax-private.h"
>> >   #include "bus.h"
>> >   
>> > +static struct resource dax_regions = DEFINE_RES_MEM_NAMED(0, -1, "DAX 
>> > Regions");
>> >   static DEFINE_MUTEX(dax_bus_lock);
>> >   
>> >   /*
>> > @@ -661,11 +662,7 @@ struct dax_region *alloc_dax_region(struct device 
>> > *parent, int region_id,
>> >          dax_region->dev = parent;
>> >          dax_region->target_node = target_node;
>> >          ida_init(&dax_region->ida);
>> > -       dax_region->res = (struct resource) {
>> > -               .start = range->start,
>> > -               .end = range->end,
>> > -               .flags = IORESOURCE_MEM | flags,
>> > -       };
>> > +       dax_region->res = __request_region(&dax_regions, range->start, 
>> > range->end, flags);
>> >   
>> >          if (sysfs_create_groups(&parent->kobj, 
>> > dax_region_attribute_groups)) {
>> >                  kfree(dax_region);
>> > 
>> > ...which will result in enforcing only one of dax_hmem or dax_cxl being
>> > able to register a dax_region.
>> > 
>> > Yes, this would leave a mess of disabled cxl_dax_region devices lying
>> > around, but it would leave more breadcrumbs for debug, and reduce the
>> > number of races you need to worry about.
>> > 
>> > In other words, I thought total teardown would be simpler, but as the
>> > feedback keeps coming in, I think that brings a different set of
>> > complexity. So just inject failures for dax_cxl to trip over and then we
>> > can go further later to effect total teardown if that proves to not be
>> > enough.
>> 
>> One concern with the approach of not tearing down CXL regions is the 
>> state it leaves behind in /proc/iomem. Soft Reserved ranges are 
>> REGISTERed to HMEM while CXL regions remain present. The resulting 
>> nesting (dax under region, region under window and window under SR) 
>> visually suggests a coherent CXL hierarchy, even though ownership has 
>> effectively moved to HMEM. When users, then attempt to tear regions down 
>> and recreate them from userspace, they hit the same HPA allocation 
>> failures described above.
>
>So this gets back to a question of do we really need "Soft Reserved" to
>show up in /proc/iomem? It is an ABI change to stop publishing it
>altogether, so at a minimum we need to be prepared to keep publishing it
>if it causes someone's working setup to regress.
>
>The current state of the for-7.0/cxl-init branch drops publishing "Soft
>Reserved". I am cautiously optimistic no one notices as long as DAX
>devices keep appearing, but at the first sign of regression we need a
>plan B.
>
>> If we decide not to tear down regions in the REGISTER case, should we 
>> gate decoder resets during user initiated region teardown? Today, 
>> decoders are reset when regions are torn down dynamically, and 
>> subsequent attempts to recreate regions can trigger a large amount of 
>> mailbox traffic. Much of what shows up as repeated “Reading event logs/ 
>> Clearing …” messages which ends up interleaved with the HPA allocation 
>> failure, which can be confusing.
>
>One of the nice side effects of installing the "Soft Reserved" entries
>late, when HMEM takes over, is that they are easier to remove.
>
>So the flow would be, if you know what you are doing, is to disable the
>HMEM device which uninstalls the "Soft Reserved" entries, before trying
>to decommit the region and reclaim the HPA space.
>

Re: [PATCH v5 6/7] dax/hmem, cxl: Defer and resolve ownership of Soft Reserved memory ranges

Reply via email to