On 21/08/2025 07:14, Alison Schofield wrote:
> On Tue, Aug 05, 2025 at 03:58:41AM +0000, Zhijian Li (Fujitsu) wrote:
>> Hi Dan and Smita,
>>
>>
>> On 24/07/2025 00:13, [email protected] wrote:
>>> dan.j.williams@ wrote:
>>> [..]
>>>> If the goal is: "I want to give device-dax a point at which it can make
>>>> a go / no-go decision about whether the CXL subsystem has properly
>>>> assembled all CXL regions implied by Soft Reserved instersecting with
>>>> CXL Windows." Then that is something like the below, only lightly tested
>>>> and likely regresses the non-CXL case.
>>>>
>>>> -- 8< --
>>>> From 48b25461eca050504cf5678afd7837307b2dd14f Mon Sep 17 00:00:00 2001
>>>> From: Dan Williams <[email protected]>
>>>> Date: Tue, 22 Jul 2025 16:11:08 -0700
>>>> Subject: [RFC PATCH] dax/cxl: Defer Soft Reserved registration
>>>
>>> Likely needs this incremental change to prevent DEV_DAX_HMEM from being
>>> built-in when CXL is not. This still leaves the awkward scenario of CXL
>>> enabled, DEV_DAX_CXL disabled, and DEV_DAX_HMEM built-in. I believe that
>>> safely fails in devdax only / fallback mode, but something to
>>> investigate when respinning on top of this.
>>>
>>
>> Thank you for your RFC; I find your proposal remarkably compelling, as it
>> adeptly addresses the issues I am currently facing.
>>
>>
>> To begin with, I still encountered several issues with your patch
>> (considering the patch at the RFC stage, I think it is already quite
>> commendable):
>
> Hi Zhijian,
>
> Like you, I tried this RFC out. It resolved the issue of soft reserved
> resources preventing teardown and replacement of a region in place.
>
> I looked at the issues you found, and have some questions comments
> included below.
>
>>
>> 1. Some resources described by SRAT are wrongly identified as System RAM
>> (kmem), such as the following: 200000000-5bffffff.
>>
>> ```
>> 200000000-5bffffff : dax6.0
>> 200000000-5bffffff : System RAM (kmem)
>> 5c0001128-5c00011b7 : port1
>> 5d0000000-64ffffff : CXL Window 0
>> 5d0000000-64ffffff : region0
>> 5d0000000-64ffffff : dax0.0
>> 5d0000000-64ffffff : System RAM (kmem)
>> 680000000-e7ffffff : PCI Bus 0000:00
>>
>> [root@rdma-server ~]# dmesg | grep -i -e soft -e hotplug
>> [ 0.000000] Command line:
>> BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc4-lizhijian-Dan+
>> root=UUID=386769a3-cfa5-47c8-8797-d5ec58c9cb6c ro earlyprintk=ttyS0
>> no_timer_check net.ifnames=0 console=tty1 console=ttyS0,115200n8
>> softlockup_panic=1 printk.devkmsg=on oops=panic sysrq_always_enabled
>> panic_on_warn ignore_loglevel kasan.fault=panic
>> [ 0.000000] BIOS-e820: [mem 0x0000000180000000-0x00000001ffffffff]
>> soft reserved
>> [ 0.000000] BIOS-e820: [mem 0x00000005d0000000-0x000000064ffffff]
>> soft reserved
>> [ 0.072114] ACPI: SRAT: Node 3 PXM 3 [mem 0x200000000-0x5bffffff]
>> hotplug
>> ```
>
> Is that range also labelled as soft reserved?
> I ask, because I'm trying to draw a parallel between our test platforms.
No, It's not a soft reserved range. This can simply simulate with QEMU with
`maxmem=192G` option(see below full qemu command line).
In my environment, `0x200000000-0x5bffffff` is something like [DRAM_END + 1,
DRAM_END + maxmem - TOTAL_INSTALLED_DRAM_SIZE]
DRAM_END: end of the installed DRAM in Node 3
This range is reserved for the DRAM hot-add. In my case, it will be registered
into 'HMEM devices' by calling hmem_register_resource in
HMAT(drivers/acpi/numa/hmat.c)
893 static void hmat_register_target_devices(struct memory_target *target)
894 {
895 struct resource *res;
896
897 /*
898 * Do not bother creating devices if no driver is available to
899 * consume them.
900 */
901 if (!IS_ENABLED(CONFIG_DEV_DAX_HMEM))
902 return;
903
904 for (res = target->memregions.child; res; res = res->sibling) {
905 int target_nid = pxm_to_node(target->memory_pxm);
906
907 hmem_register_resource(target_nid, res);
908 }
909 }
$ dmesg | grep -i -e soft -e hotplug -e Node
[ 0.000000] Command line:
BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc4-lizhijian-Dan-00026-g1473b9914846-dirty
root=UUID=386769a3-cfa5-47c8-8797-d5ec58c9cb6c ro earlyprintk=ttyS0
no_timer_check net.ifnames=0 console=tty1 conc
[ 0.000000] BIOS-e820: [mem 0x0000000180000000-0x00000001ffffffff] soft
reserved
[ 0.000000] BIOS-e820: [mem 0x00000005d0000000-0x000000064fffffff] soft
reserved
[ 0.066332] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[ 0.067665] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x7fffffff]
[ 0.068995] ACPI: SRAT: Node 1 PXM 1 [mem 0x100000000-0x17fffffff]
[ 0.070359] ACPI: SRAT: Node 2 PXM 2 [mem 0x180000000-0x1bfffffff]
[ 0.071723] ACPI: SRAT: Node 3 PXM 3 [mem 0x1c0000000-0x1ffffffff]
[ 0.073085] ACPI: SRAT: Node 3 PXM 3 [mem 0x200000000-0x5bfffffff] hotplug
[ 0.075689] NUMA: Node 0 [mem 0x00001000-0x0009ffff] + [mem
0x00100000-0x7fffffff] -> [mem 0x00001000-0x7fffffff]
[ 0.077849] NODE_DATA(0) allocated [mem 0x7ffb3e00-0x7ffdefff]
[ 0.079149] NODE_DATA(1) allocated [mem 0x17ffd1e00-0x17fffcfff]
[ 0.086077] Movable zone start for each node
[ 0.087054] Early memory node ranges
[ 0.087890] node 0: [mem 0x0000000000001000-0x000000000009efff]
[ 0.089264] node 0: [mem 0x0000000000100000-0x000000007ffdefff]
[ 0.090631] node 1: [mem 0x0000000100000000-0x000000017fffffff]
[ 0.092003] Initmem setup node 0 [mem 0x0000000000001000-0x000000007ffdefff]
[ 0.093532] Initmem setup node 1 [mem 0x0000000100000000-0x000000017fffffff]
[ 0.095164] Initmem setup node 2 as memoryless
[ 0.096281] Initmem setup node 3 as memoryless
[ 0.097397] Initmem setup node 4 as memoryless
[ 0.098444] On node 0, zone DMA: 1 pages in unavailable ranges
[ 0.099866] On node 0, zone DMA: 97 pages in unavailable ranges
[ 0.104342] On node 1, zone Normal: 33 pages in unavailable ranges
[ 0.126883] CPU topo: Allowing 4 present CPUs plus 0 hotplug CPUs
=================================
Please note that this is a modified QEMU.
/home/lizhijian/qemu/build-hmem/qemu-system-x86_64 -machine
q35,accel=kvm,cxl=on,hmat=on \
-name guest-rdma-server -nographic -boot c \
-m size=6G,slots=2,maxmem=19922944k \
-hda /home/lizhijian/images/Fedora-rdma-server.qcow2 \
-object memory-backend-memfd,share=on,size=2G,id=m0 \
-object memory-backend-memfd,share=on,size=2G,id=m1 \
-numa node,nodeid=0,cpus=0-1,memdev=m0 \
-numa node,nodeid=1,cpus=2-3,memdev=m1 \
-smp 4,sockets=2,cores=2 \
-device pcie-root-port,id=pci-root,slot=8,bus=pcie.0,chassis=0 \
-device
pxb-cxl,id=pxb-cxl-host-bridge,bus=pcie.0,bus_nr=0x35,hdm_for_passthrough=true \
-device cxl-rp,id=cxl-rp-hb-rp0,bus=pxb-cxl-host-bridge,chassis=0,slot=0,port=0
\
-device
cxl-type3,bus=cxl-rp-hb-rp0,volatile-memdev=cxl-vmem0,id=cxl-vmem0,program-hdm-decoder=true
\
-object
memory-backend-file,id=cxl-vmem0,share=on,mem-path=/home/lizhijian/images/cxltest0.raw,size=2048M
\
-M
cxl-fmw.0.targets.0=pxb-cxl-host-bridge,cxl-fmw.0.size=2G,cxl-fmw.0.interleave-granularity=8k
\
-nic bridge,br=virbr0,model=e1000,mac=52:54:00:c9:76:74 \
-bios /home/lizhijian/seabios/out/bios.bin \
-object memory-backend-memfd,share=on,size=1G,id=m2 \
-object memory-backend-memfd,share=on,size=1G,id=m3 \
-numa node,memdev=m2,nodeid=2 \
-numa node,memdev=m3,nodeid=3 \
-numa dist,src=0,dst=0,val=10 \
-numa dist,src=0,dst=1,val=21 \
-numa dist,src=0,dst=2,val=21 \
-numa dist,src=0,dst=3,val=21 \
-numa dist,src=1,dst=0,val=21 \
-numa dist,src=1,dst=1,val=10 \
-numa dist,src=1,dst=2,val=21 \
-numa dist,src=1,dst=3,val=21 \
-numa dist,src=2,dst=0,val=21 \
-numa dist,src=2,dst=1,val=21 \
-numa dist,src=2,dst=2,val=10 \
-numa dist,src=2,dst=3,val=21 \
-numa dist,src=3,dst=0,val=21 \
-numa dist,src=3,dst=1,val=21 \
-numa dist,src=3,dst=2,val=21 \
-numa dist,src=3,dst=3,val=10 \
-numa
hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=110
\
-numa
hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=20000M
\
-numa
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=240
\
-numa
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=40000M
\
-numa
hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-latency,latency=340
\
-numa
hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-bandwidth,bandwidth=60000M
\
-numa
hmat-lb,initiator=0,target=3,hierarchy=memory,data-type=access-latency,latency=440
\
-numa
hmat-lb,initiator=0,target=3,hierarchy=memory,data-type=access-bandwidth,bandwidth=80000M
\
-numa
hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-latency,latency=240
\
-numa
hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=40000M
\
-numa
hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-latency,latency=110
\
-numa
hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=20000M
\
-numa
hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-latency,latency=340
\
-numa
hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-bandwidth,bandwidth=60000M
\
-numa
hmat-lb,initiator=1,target=3,hierarchy=memory,data-type=access-latency,latency=440
\
-numa
hmat-lb,initiator=1,target=3,hierarchy=memory,data-type=access-bandwidth,bandwidth=80000M
> I see -
>
> [] BIOS-e820: [mem 0x0000024080000000-0x000004407fffffff] soft reserved
> .
> .
> [] reserve setup_data: [mem 0x0000024080000000-0x000004407fffffff] soft
> reserved
> .
> .
> [] ACPI: SRAT: Node 6 PXM 14 [mem 0x24080000000-0x4407fffffff] hotplug
>
> /proc/iomem - as expected
> 24080000000-5f77fffffff : CXL Window 0
> 24080000000-4407fffffff : region0
> 24080000000-4407fffffff : dax0.0
> 24080000000-4407fffffff : System RAM (kmem)
>
>
> I'm also seeing this message:
> [] resource: Unaddressable device [mem 0x24080000000-0x4407fffffff]
> conflicts with [mem 0x24080000000-0x4407fffffff]
>
>>
>> 2. Triggers dev_warn and dev_err:
>>
>> ```
>> [root@rdma-server ~]# journalctl -p err -p warning --dmesg
>> ...snip...
>> Jul 29 13:17:36 rdma-server kernel: cxl root0: Extended linear cache
>> calculation failed rc:-2
>> Jul 29 13:17:36 rdma-server kernel: hmem hmem.1: probe with driver hmem
>> failed with error -12
>> Jul 29 13:17:36 rdma-server kernel: hmem hmem.2: probe with driver hmem
>> failed with error -12
>> Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: mapping0:
>> 0x100000000-0x17ffffff could not reserve region
>> Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: probe with driver kmem
>> failed with error -16
>
> I see the kmem dax messages also. It seems the kmem probe is going after
> every range (except hotplug) in the SRAT, and failing.
Yes, that's true, because current RFC removed the code that filters out the
non-soft-reserverd resource. As a result, it will try to register dax/kmem for
all of them while some of them has been marked as busy in the iomem_resource.
>> - rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
>> - IORES_DESC_SOFT_RESERVED);
>> - if (rc != REGION_INTERSECTS)
>> - return 0;
This is another example on my real *CXL HOST*:
Aug 19 17:59:05 kernel: device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is
disabled. Duplicate IMA measuremen>
Aug 19 17:59:09 kernel: power_meter ACPI000D:00: Ignoring unsafe software
power cap!
Aug 19 17:59:09 kernel: kmem dax2.0: mapping0: 0x0-0x8fffffff could not
reserve region
Aug 19 17:59:09 kernel: kmem dax2.0: probe with driver kmem failed with error
-16
Aug 19 17:59:09 kernel: kmem dax3.0: mapping0: 0x100000000-0x86fffffff could
not reserve region
Aug 19 17:59:09 kernel: kmem dax3.0: probe with driver kmem failed with error
-16
Aug 19 17:59:09 kernel: kmem dax4.0: mapping0: 0x870000000-0x106fffffff could
not reserve region
Aug 19 17:59:09 kernel: kmem dax4.0: probe with driver kmem failed with error
-16
Aug 19 17:59:19 kernel: nvme nvme0: using unchecked data buffer
Aug 19 18:36:27 kernel: block nvme1n1: No UUID available providing old NGUID
lizhijian@:~$ sudo grep -w -e 106fffffff -e 870000000 -e 8fffffff -e 100000000
/proc/iomem
6fffb000-8fffffff : Reserved
100000000-10000ffff : Reserved
106ccc0000-106fffffff : Reserved
This issue can be resolved by re-introducing
sort_reserved_region_intersects(...) I guess.
>
>> ```
>>
>> 3. When CXL_REGION is disabled, there is a failure to fallback to dax_hmem,
>> in which case only CXL Window X is visible.
>
> Haven't tested !CXL_REGION yet.
>
>>
>> On failure:
>>
>> ```
>> 100000000-27ffffff : System RAM
>> 5c0001128-5c00011b7 : port1
>> 5c0011128-5c00111b7 : port2
>> 5d0000000-6cffffff : CXL Window 0
>> 6d0000000-7cffffff : CXL Window 1
>> 7000000000-700000ffff : PCI Bus 0000:0c
>> 7000000000-700000ffff : 0000:0c:00.0
>> 7000001080-70000010d7 : mem1
>> ```
>>
>> On success:
>>
>> ```
>> 5d0000000-7cffffff : dax0.0
>> 5d0000000-7cffffff : System RAM (kmem)
>> 5d0000000-6cffffff : CXL Window 0
>> 6d0000000-7cffffff : CXL Window 1
>> ```
>>
>> In term of issues 1 and 2, this arises because hmem_register_device()
>> attempts to register resources of all "HMEM devices," whereas we only need
>> to register the IORES_DESC_SOFT_RESERVED resources. I believe resolving the
>> current TODO will address this.
>>
>> ```
>> - rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
>> - IORES_DESC_SOFT_RESERVED);
>> - if (rc != REGION_INTERSECTS)
>> - return 0;
>> + /* TODO: insert "Soft Reserved" into iomem here */
>> ```
>
> Above makes sense.
I think the subroutine add_soft_reserved() in your previous patchset[1] are
able to cover this TODO
>
> I'll probably wait for an update from Smita to test again, but if you
> or Smita have anything you want me to try out on my hardwware in the
> meantime, let me know.
>
Here is my local fixup based on Dan's RFC, it can resovle issue 1 and 2.
-- 8< --
commit e7ccd7a01e168e185971da66f4aa13eb451caeaf
Author: Li Zhijian <[email protected]>
Date: Fri Aug 20 11:07:15 2025 +0800
Fix probe-order TODO
Signed-off-by: Li Zhijian <[email protected]>
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 754115da86cc..965ffc622136 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -93,6 +93,26 @@ static void process_defer_work(struct work_struct *_work)
walk_hmem_resources(&pdev->dev, handle_deferred_cxl);
}
+static int add_soft_reserved(resource_size_t start, resource_size_t len,
+ unsigned long flags)
+{
+ struct resource *res = kzalloc(sizeof(*res), GFP_KERNEL);
+ int rc;
+
+ if (!res)
+ return -ENOMEM;
+
+ *res = DEFINE_RES_NAMED_DESC(start, len, "Soft Reserved",
+ flags | IORESOURCE_MEM,
+ IORES_DESC_SOFT_RESERVED);
+
+ rc = insert_resource(&iomem_resource, res);
+ if (rc)
+ kfree(res);
+
+ return rc;
+}
+
static int hmem_register_device(struct device *host, int target_nid,
const struct resource *res)
{
@@ -102,6 +122,10 @@ static int hmem_register_device(struct device *host, int
target_nid,
long id;
int rc;
+ if (soft_reserve_res_intersects(res->start, resource_size(res),
+ IORESOURCE_MEM, IORES_DESC_NONE) == REGION_DISJOINT)
+ return 0;
+
if (IS_ENABLED(CONFIG_DEV_DAX_CXL) &&
region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
IORES_DESC_CXL) != REGION_DISJOINT) {
@@ -119,7 +143,17 @@ static int hmem_register_device(struct device *host, int
target_nid,
}
}
- /* TODO: insert "Soft Reserved" into iomem here */
+ /*
+ * This is a verified Soft Reserved region that CXL is not claiming (or
+ * is being overridden). Add it to the main iomem tree so it can be
+ * properly reserved by the DAX driver.
+ */
+ rc = add_soft_reserved(res->start, res->end - res->start + 1, 0);
+ if (rc) {
+ dev_warn(host, "failed to insert soft-reserved resource %pr
into iomem: %d\n",
+ res, rc);
+ return rc;
+ }
id = memregion_alloc(GFP_KERNEL);
if (id < 0) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 349f0d9aad22..eca5956c444b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1069,6 +1069,8 @@ enum {
int region_intersects(resource_size_t offset, size_t size, unsigned long
flags,
unsigned long desc);
+int soft_reserve_res_intersects(resource_size_t offset, size_t size, unsigned
long flags,
+ unsigned long desc);
/* Support for virtually mapped pages */
struct page *vmalloc_to_page(const void *addr);
unsigned long vmalloc_to_pfn(const void *addr);
diff --git a/kernel/resource.c b/kernel/resource.c
index b8eac6af2fad..a34b76cf690a 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -461,6 +461,22 @@ int walk_soft_reserve_res_desc(unsigned long desc,
unsigned long flags,
arg, func);
}
EXPORT_SYMBOL_GPL(walk_soft_reserve_res_desc);
+
+static int __region_intersects(struct resource *parent, resource_size_t start,
+ size_t size, unsigned long flags,
+ unsigned long desc);
+int soft_reserve_res_intersects(resource_size_t start, size_t size, unsigned
long flags,
+ unsigned long desc)
+{
+ int ret;
+
+ read_lock(&resource_lock);
+ ret = __region_intersects(&soft_reserve_resource, start, size, flags,
desc);
+ read_unlock(&resource_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(soft_reserve_res_intersects);
#endif
/*
[1]
https://lore.kernel.org/linux-cxl/29312c0765224ae76862d59a17748c8188fb95f1.1692638817.git.alison.schofi...@intel.com/
> -- Alison
>
>
>>
>> Regarding issue 3 (which exists in the current situation), this could be
>> because it cannot ensure that dax_hmem_probe() executes prior to
>> cxl_acpi_probe() when CXL_REGION is disabled.
>>
>> I am pleased that you have pushed the patch to the
>> cxl/for-6.18/cxl-probe-order branch, and I'm looking forward to its
>> integration into the upstream during the v6.18 merge window.
>> Besides the current TODO, you also mentioned that this RFC PATCH must be
>> further subdivided into several patches, so there remains significant work
>> to be done.
>> If my understanding is correct, you would be personally continuing to push
>> forward this patch, right?
>>
>>
>> Smita,
>>
>> Do you have any additional thoughts on this proposal from your side?
>>
>>
>> Thanks
>> Zhijian
>>
> snip
>