There is no atomic mechanism to offline and remove an entire
multi-block DAX kmem device.  This is presently done in two steps:
    1. offline all
    2. remove all).

This creates a race condition where another entity operates directly
on the memory blocks and can cause hot-unplug to fail / unbind to
deadlock.

Add a new 'state' sysfs attribute that enables an atomic whole-device
hotplug operation across its entire memory region.

daxX.Y/state mirrors the per-block memoryX/state ABI:
  - [offline, online, online_kernel, online_movable]
  - "unplugged" - is added specifically for dax0.0/state

The valid writable states include:
  - "unplugged":      memory blocks are not present
  - "online":         memory is online, zone chosen by the kernel
  - "online_kernel":  memory is online in ZONE_NORMAL
  - "online_movable": memory is online in ZONE_MOVABLE

Valid transitions:
  - unplugged                -> online[_kernel|_movable]
  - online[_kernel|_movable] -> unplugged
  - offline                  -> unplugged

A device can only be onlined from "unplugged", so it must be returned
there before being onlined into a different state.

For backwards compatibility the memory blocks are always created at
probe - existing tools expect them to be present after kmem binds.

"offline" is therefore a reportable state but is not writable: it only
arises from the legacy auto_online_blocks=offline policy.  Onlining
such a device through this attribute requires unplugging it first in
an effort to get drivers creating DAX devices to set a default.

Unplug is atomic across the whole device: dax_kmem_do_hotremove()
collects every added range and offlines/removes them in one operation.
Either the operation succeeds or is entirely rolled back.

Unbind Note:
  We used to call remove_memory() during unbind, which would fire a
  BUG() if any of the memory blocks were online at that time.  We lift
  this into a WARN in the cleanup routine and don't attempt hotremove
  if ->state is not DAX_KMEM_UNPLUGGED or MMOP_OFFLINE.

  An offline dax device memory is removed on unbind as before.

  If online at unbind, the resources are leaked (as before), but now
  we prevent deadlock if a memory region is impossible to hotremove.

Suggested-by: Hannes Reinecke <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Signed-off-by: Gregory Price <[email protected]>
---
 Documentation/ABI/testing/sysfs-bus-dax |  26 +++
 drivers/base/memory.c                   |   9 +
 drivers/dax/kmem.c                      | 224 ++++++++++++++++++++----
 include/linux/memory_hotplug.h          |   1 +
 4 files changed, 224 insertions(+), 36 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-dax 
b/Documentation/ABI/testing/sysfs-bus-dax
index b34266bfae49..2dcad1e9dad0 100644
--- a/Documentation/ABI/testing/sysfs-bus-dax
+++ b/Documentation/ABI/testing/sysfs-bus-dax
@@ -151,3 +151,29 @@ Description:
                memmap_on_memory parameter for memory_hotplug. This is
                typically set on the kernel command line -
                memory_hotplug.memmap_on_memory set to 'true' or 'force'."
+
+What:          /sys/bus/dax/devices/daxX.Y/state
+Date:          June, 2026
+KernelVersion: v6.21
+Contact:       [email protected]
+Description:
+               (RW) Controls the state of the memory region.
+               Applies to all memory blocks associated with the device.
+               Only applies to dax_kmem devices.
+
+               Reading returns the current state; the writable states mirror
+               the per-block /sys/devices/system/memory/memoryX/state ABI::
+
+                 "unplugged": memory blocks are not present
+                 "online": memory is online, zone chosen by the kernel
+                 "online_kernel": memory is online in ZONE_NORMAL
+                 "online_movable": memory is online in ZONE_MOVABLE
+
+               "offline" (memory blocks are present but offline) may also be
+               reported - this happens when the device is bound while the
+               auto_online_blocks policy is "offline".  It cannot be written,
+               as it's not useful and creates device destruction races.
+
+               A device can only be onlined from the "unplugged" state, so a
+               device must be returned to "unplugged" before it can be onlined
+               into a different state.
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index b318344426fa..3a2f69d3af7b 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -46,6 +46,15 @@ int mhp_online_type_from_str(const char *str)
        }
        return -EINVAL;
 }
+EXPORT_SYMBOL_GPL(mhp_online_type_from_str);
+
+const char *mhp_online_type_to_str(int online_type)
+{
+       if (online_type < 0 || online_type >= 
(int)ARRAY_SIZE(online_type_to_str))
+               return NULL;
+       return online_type_to_str[online_type];
+}
+EXPORT_SYMBOL_GPL(mhp_online_type_to_str);
 
 #define to_memory_block(dev) container_of(dev, struct memory_block, dev)
 
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a45e50def537..340486586d82 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -42,9 +42,15 @@ static int dax_kmem_range(struct dev_dax *dev_dax, int i, 
struct range *r)
        return 0;
 }
 
+#define DAX_KMEM_UNPLUGGED     (-1)
+
 struct dax_kmem_data {
        const char *res_name;
        int mgid;
+       int numa_node;
+       struct dev_dax *dev_dax;
+       int state;
+       struct mutex lock; /* protects hotplug state transitions */
        struct resource *res[];
 };
 
@@ -63,12 +69,22 @@ static void kmem_put_memory_types(void)
        mt_put_memory_types(&kmem_memory_types);
 }
 
+/* True for the online states a kmem dax device can hold. */
+static bool dax_kmem_state_is_online(int state)
+{
+       return state == MMOP_ONLINE ||
+              state == MMOP_ONLINE_KERNEL ||
+              state == MMOP_ONLINE_MOVABLE;
+}
+
 /**
  * dax_kmem_do_hotplug - hotplug memory for dax kmem device
  * @dev_dax: the dev_dax instance
  * @data: the dax_kmem_data structure with resource tracking
+ * @online_type: the online policy to use for the memory blocks
  *
- * Hotplugs all ranges in the dev_dax region as system memory.
+ * Hotplugs all ranges in the dev_dax region as system memory with the
+ * provided online policy (offline, online, online_movable, online_kernel).
  *
  * Returns the number of successfully mapped ranges, or negative error.
  */
@@ -77,9 +93,15 @@ static int dax_kmem_do_hotplug(struct dev_dax *dev_dax,
                               int online_type)
 {
        struct device *dev = &dev_dax->dev;
-       int i, rc, onlined = 0;
+       int i, rc, added = 0;
        mhp_t mhp_flags;
 
+       if (dax_kmem_state_is_online(data->state))
+               return -EINVAL;
+
+       if (online_type < MMOP_OFFLINE || online_type > MMOP_ONLINE_MOVABLE)
+               return -EINVAL;
+
        for (i = 0; i < dev_dax->nr_range; i++) {
                struct range range;
 
@@ -112,14 +134,14 @@ static int dax_kmem_do_hotplug(struct dev_dax *dev_dax,
                                kfree(data->res[i]);
                                data->res[i] = NULL;
                        }
-                       if (onlined)
+                       if (added)
                                continue;
                        return rc;
                }
-               onlined++;
+               added++;
        }
 
-       return onlined;
+       return added;
 }
 
 /**
@@ -182,45 +204,64 @@ static int dax_kmem_init_resources(struct dev_dax 
*dev_dax,
  * @dev_dax: the dev_dax instance
  * @data: the dax_kmem_data structure with resource tracking
  *
- * Removes all ranges in the dev_dax region.
+ * Offlines and removes every currently-added range in the dev_dax region
+ * atomically: either all ranges are offlined and removed, or none are and
+ * the device is returned to its prior state.
  *
- * Returns the number of successfully removed ranges.
+ * Returns 0 on success, or a negative errno on failure.
  */
 static int dax_kmem_do_hotremove(struct dev_dax *dev_dax,
                                 struct dax_kmem_data *data)
 {
        struct device *dev = &dev_dax->dev;
-       int i, success = 0;
+       struct range *ranges;
+       int i, nr_ranges = 0, rc;
+
+       ranges = kmalloc_array(dev_dax->nr_range, sizeof(*ranges), GFP_KERNEL);
+       if (!ranges)
+               return -ENOMEM;
 
+       /* Collect the ranges that were actually added during probe. */
        for (i = 0; i < dev_dax->nr_range; i++) {
                struct range range;
-               int rc;
 
-               rc = dax_kmem_range(dev_dax, i, &range);
-               if (rc)
+               if (!data->res[i])
                        continue;
-
-               /* range was never added during probe, count as removed */
-               if (!data->res[i]) {
-                       success++;
+               if (dax_kmem_range(dev_dax, i, &range))
                        continue;
-               }
+               ranges[nr_ranges++] = range;
+       }
 
-               rc = remove_memory(range.start, range_len(&range));
-               if (rc == 0) {
-                       /* Release the resource for the successfully removed 
range */
-                       remove_resource(data->res[i]);
-                       kfree(data->res[i]);
-                       data->res[i] = NULL;
-                       success++;
-                       continue;
-               }
+       /* Nothing added means nothing to remove. */
+       if (!nr_ranges) {
+               kfree(ranges);
+               return 0;
+       }
+
+       rc = offline_and_remove_memory_ranges(ranges, nr_ranges);
+       kfree(ranges);
+       if (rc) {
                any_hotremove_failed = true;
-               dev_err(dev, "mapping%d: %#llx-%#llx hotremove failed\n",
-                       i, range.start, range.end);
+               dev_err(dev, "hotremove failed, device left online: %d\n", rc);
+               return rc;
        }
 
-       return success;
+       /* All ranges removed; release the reserved resources. */
+       for (i = 0; i < dev_dax->nr_range; i++) {
+               if (!data->res[i])
+                       continue;
+               remove_resource(data->res[i]);
+               kfree(data->res[i]);
+               data->res[i] = NULL;
+       }
+
+       return 0;
+}
+#else
+static int dax_kmem_do_hotremove(struct dev_dax *dev_dax,
+                                struct dax_kmem_data *data)
+{
+       return -EBUSY;
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
@@ -236,6 +277,18 @@ static void dax_kmem_cleanup_resources(struct dev_dax 
*dev_dax,
 {
        int i;
 
+       /*
+        * If the device unbind occurs before memory is hotremoved, we can never
+        * remove the memory (requires reboot).  Attempting an offline operation
+        * here may cause deadlock and a failure to finish the unbind.
+        *
+        * Note: This leaks the resources.
+        */
+       if (WARN(((data->state != DAX_KMEM_UNPLUGGED) &&
+                 (data->state != MMOP_OFFLINE)),
+                "Hotplug memory regions stuck online until reboot"))
+               return;
+
        for (i = 0; i < dev_dax->nr_range; i++) {
                if (!data->res[i])
                        continue;
@@ -245,6 +298,85 @@ static void dax_kmem_cleanup_resources(struct dev_dax 
*dev_dax,
        }
 }
 
+static int dax_kmem_parse_state(const char *buf)
+{
+       int online_type;
+
+       /* "unplugged" is kmem-specific - the rest map to MMOP_ */
+       if (sysfs_streq(buf, "unplugged"))
+               return DAX_KMEM_UNPLUGGED;
+
+       online_type = mhp_online_type_from_str(buf);
+       /* Disallow "offline": it's not useful and creates race conditions */
+       if (online_type == MMOP_OFFLINE)
+               return -EINVAL;
+       return online_type;
+}
+
+static ssize_t state_show(struct device *dev,
+                           struct device_attribute *attr, char *buf)
+{
+       struct dax_kmem_data *data = dev_get_drvdata(dev);
+       const char *state_str;
+
+       if (!data)
+               return -ENXIO;
+
+       if (data->state == DAX_KMEM_UNPLUGGED)
+               state_str = "unplugged";
+       else
+               state_str = mhp_online_type_to_str(data->state);
+
+       return sysfs_emit(buf, "%s\n", state_str ?: "unknown");
+}
+
+static ssize_t state_store(struct device *dev, struct device_attribute *attr,
+                            const char *buf, size_t len)
+{
+       struct dev_dax *dev_dax = to_dev_dax(dev);
+       struct dax_kmem_data *data = dev_get_drvdata(dev);
+       int online_type;
+       int rc;
+
+       if (!data)
+               return -ENXIO;
+
+       online_type = dax_kmem_parse_state(buf);
+       if (online_type < DAX_KMEM_UNPLUGGED)
+               return online_type;
+
+       guard(mutex)(&data->lock);
+
+       /* Already in requested state */
+       if (data->state == online_type)
+               return len;
+
+       if (online_type == DAX_KMEM_UNPLUGGED) {
+               rc = dax_kmem_do_hotremove(dev_dax, data);
+               if (rc)
+                       return rc;
+               data->state = DAX_KMEM_UNPLUGGED;
+               return len;
+       }
+
+       /* Onlining is only allowed from the unplugged state. */
+       if (data->state != DAX_KMEM_UNPLUGGED)
+               return -EBUSY;
+
+       /* Re-acquire resources if previously unplugged, otherwise no-op */
+       rc = dax_kmem_init_resources(dev_dax, data);
+       if (rc < 0)
+               return rc;
+
+       rc = dax_kmem_do_hotplug(dev_dax, data, online_type);
+       if (rc < 0)
+               return rc;
+
+       data->state = online_type;
+       return len;
+}
+static DEVICE_ATTR_RW(state);
+
 static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 {
        struct device *dev = &dev_dax->dev;
@@ -313,6 +445,10 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
        if (rc < 0)
                goto err_reg_mgid;
        data->mgid = rc;
+       data->numa_node = numa_node;
+       data->dev_dax = dev_dax;
+       data->state = DAX_KMEM_UNPLUGGED;
+       mutex_init(&data->lock);
 
        dev_set_drvdata(dev, data);
 
@@ -325,9 +461,15 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
        if (online_type == DAX_ONLINE_DEFAULT)
                online_type = mhp_get_default_online_type();
 
+       /* Always create blocks for backward compatibility, even if offline */
        rc = dax_kmem_do_hotplug(dev_dax, data, online_type);
        if (rc < 0)
                goto err_hotplug;
+       data->state = online_type;
+
+       rc = device_create_file(dev, &dev_attr_state);
+       if (rc)
+               dev_warn(dev, "failed to create state sysfs entry\n");
 
        return 0;
 
@@ -348,20 +490,26 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 #ifdef CONFIG_MEMORY_HOTREMOVE
 static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 {
-       int success;
        int node = dev_dax->target_node;
        struct device *dev = &dev_dax->dev;
        struct dax_kmem_data *data = dev_get_drvdata(dev);
 
+       device_remove_file(dev, &dev_attr_state);
        /*
-        * We have one shot for removing memory, if some memory blocks were not
-        * offline prior to calling this function remove_memory() will fail, and
-        * there is no way to hotremove this memory until reboot because device
-        * unbind will succeed even if we return failure.
+        * Online memory cannot safely be removed (offlining during unbind can
+        * deadlock a task as unbind cannot be interrupted).  Unfortunately we
+        * have to leak all of [resources, memory group, @data, memtype], until
+        * the next reboot - and the memory will stay online until then.
+        *
+        * offline blocks are removed on unbind, but may leak on failure.
         */
-       success = dax_kmem_do_hotremove(dev_dax, data);
-       if (success < dev_dax->nr_range) {
-               dev_err(dev, "Hotplug regions stuck online until reboot\n");
+       if (dax_kmem_state_is_online(data->state)) {
+               dev_warn(dev, "Hotplug regions stuck online until reboot\n");
+               any_hotremove_failed = true;
+               return;
+       } else if (data->state == MMOP_OFFLINE &&
+           dax_kmem_do_hotremove(dev_dax, data)) {
+               dev_warn(dev, "Unplug failed, resources leaked until reboot\n");
                return;
        }
 
@@ -382,6 +530,10 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 #else
 static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 {
+       struct device *dev = &dev_dax->dev;
+
+       device_remove_file(dev, &dev_attr_state);
+
        /*
         * Without hotremove purposely leak the request_mem_region() for the
         * device-dax range and return '0' to ->remove() attempts. The removal
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7f1da7c428dc..46c796570692 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -127,6 +127,7 @@ extern int arch_add_memory(int nid, u64 start, u64 size,
 extern u64 max_mem_size;
 
 extern int mhp_online_type_from_str(const char *str);
+const char *mhp_online_type_to_str(int online_type);
 
 /* If movable_node boot option specified */
 extern bool movable_node_enabled;
-- 
2.54.0


Reply via email to