This is an automated email from the ASF dual-hosted git repository.
weichiu pushed a commit to branch HDDS-5713
in repository https://gitbox.apache.org/repos/asf/ozone.git
The following commit(s) were added to refs/heads/HDDS-5713 by this push:
new 2334f41049 HDDS-12598. [DiskBalancer] Add design and feature document
(#8837)
2334f41049 is described below
commit 2334f4104972084cf8beed22f47d744e468abf48
Author: Gargi Jaiswal <[email protected]>
AuthorDate: Wed Jul 23 12:59:57 2025 +0530
HDDS-12598. [DiskBalancer] Add design and feature document (#8837)
---
hadoop-hdds/docs/content/design/diskbalancer.md | 136 +++++++++++++++++++++
hadoop-hdds/docs/content/feature/DiskBalancer.md | 122 ++++++++++++++++++
.../docs/content/feature/DiskBalancer.zh.md | 116 ++++++++++++++++++
hadoop-hdds/docs/content/feature/diskBalancer.png | Bin 0 -> 116124 bytes
4 files changed, 374 insertions(+)
diff --git a/hadoop-hdds/docs/content/design/diskbalancer.md
b/hadoop-hdds/docs/content/design/diskbalancer.md
new file mode 100644
index 0000000000..5121631c03
--- /dev/null
+++ b/hadoop-hdds/docs/content/design/diskbalancer.md
@@ -0,0 +1,136 @@
+---
+title: "DiskBalancer for Datanode"
+summary: "DiskBalancer is a feature to evenly distribute data across all disks
within a Datanode for even disk utilisation."
+date: 2025-07-21
+jira: HDDS-5713
+status: implementing
+author: Janus Chow, Sammi Chen, Gargi Jaiswal, Stephen O' Donnell
+---
+<!--
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License. See accompanying LICENSE file.
+-->
+# [HDDS-5713](https://issues.apache.org/jira/browse/HDDS-5713) DiskBalancer
for Datanode (implementing)
+
+## Background
+**Apache Ozone** works well to distribute all containers evenly
+across all multiple disks on each Datanode. This initial spread
+ensures that I/O load is balanced from the start. However,
+over the operational lifetime of a cluster **disk imbalance** can
+occur due to the following reasons:
+- **Adding new disks** to expand datanode storage space.
+- **Replacing old broken disks** with new disks.
+- Massive **block** or **replica deletions**.
+
+This uneven utilisation of disks can create performance bottlenecks, as
+**over-utilised disks** become **hotspots** limiting the overall throughput of
the
+Datanode. As a result, this new feature, **DiskBalancer**, is introduced to
+ensure even data distribution across disks within a Datanode.
+
+## Proposed Solution
+The DiskBalancer is a feature which evenly distributes data across
+different disks of a Datanode.
+
+It detects an imbalance within datanode, using the term from
+[HDFS
DiskBalancer](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html)
+metric called **Volume Data Density**. This metric is calculated for
+each disk using following formula:
+
+```
+AverageUtilization = TotalUsed / TotalCapacity
+VolumeUtilization = diskUsedSpace / diskCapacity
+VolumeDensity = | VolumeUtilization - AverageUtilization |
+```
+Here, **VolumeUtilization** is each disk's individual utilization and
+**AverageUtilization** is the ideal utilization for all disks to maintain
+eveness.
+
+A disk is considered a candidate for balancing if its `VolumeDataDensity`
exceeds a configurable
+`threshold`. The DiskBalancer then moves the containers from most
+utilised disk to the least utilised disk. DiskBalancer can be triggered
manually by **CLI commands**.
+
+## High-Level DiskBalancer Implementation
+
+The general view of this design consists of 3 parts as follows:
+
+**Client & SCM -:**
+
+Administrators use the `ozone admin datanode diskbalancer` CLI to manage and
monitor the feature.
+* Clients can control the DiskBalancer job by sending requests to SCM, like
start,
+stop, update configuration and can query for diskBalancer status.
+* Clients get storageReport from SCM to decide which datanode to balance.
+
+**SCM & DN -:**
+
+SCM acts as a **control plane** and information hub but remains **stateless**
+regarding the balancing process.
+* SCM retrieves storageReport and balance status via heartbeat from DN.
+
+**DN -:**
+
+All balance operations are done in dataNodes.
+
+A daemon thread, the **Scheduler**, runs periodically on each Datanode.
+1. It calculates the `VolumeDataDensity` for all volumes.
+2. If an imbalance is detected (i.e., density > threshold), it moves a set of
closed containers
+from the most over-utilized disk (source) to the least utilized disk
(destination).
+3. The scheduler dispatches these move tasks to a pool of **Worker** threads
for parallel execution.
+
+## Container Move Process
+
+Suppose, we are moving container C1 **(CLOSED state)** from Source Disk d1 to
Destination disk d2 :
+1. A temporary copy, `Temp C1-CLOSED`, is created in the `temp directory` of
the destination disk D2.
+
+2. `Temp C1-CLOSED` is transitioned to `Temp C1-RECOVERING` state. This **Temp
C1-RECOVERING** container is now
+atomically moved to the **final destination** directory of D2 as
`C1-RECOVERING`.
+3. Now **new container** import is initiated for `C1-RECOVERING` container.
+4. Once the import is successful, all the metadata updates are done for this
new container created on D2.
+5. Finally, the original container `C1-CLOSED` on D1 is deleted.
+
+```
+D1 ----> C1-CLOSED --- (5) ---> C1-DELETED
+ |
+ |
+ (1)
+ |
+D2 ----> Temp C1-CLOSED --- (2) ---> Temp C1-RECOVERING --- (3) --->
C1-RECOVERING --- (4) ---> C1-CLOSED
+```
+## DiskBalancing Policies
+
+By default, the DiskBalancer uses specific policies to decide which disks to
balance and which containers to move. These
+are configurable, but the default implementations provide robust and safe
behavior.
+
+* **`DefaultVolumeChoosingPolicy`**: This is the default policy for
selecting the source and destination volumes. It
+identifies the most over-utilized volume as the source and the most
under-utilized volume as the destination by comparing
+each volume's utilization against the Datanode's average. The calculation is
smart enough to account for data that is
+already in the process of being moved, ensuring it makes accurate decisions
based on the future state of the volumes.
+
+* **`DefaultContainerChoosingPolicy`**: This is the default policy for
selecting which container to move from a source
+volume. It iterates through the containers on the source disk and picks the
first one that is in a **CLOSED** state
+and is not already being moved by another balancing operation. To optimize
performance and avoid re-scanning the same
+containers repeatedly, it caches the list of containers for each volume which
auto expires after one hour of its last
+used time or if the container iterator for that is invalidated on full
utilisation.
+
+## DiskBalancer Metrics
+
+The DiskBalancer service exposes JMX metrics on each Datanode for real-time
monitoring. These metrics provide insights
+into the balancer's activity, progress, and overall health.
+
+| DiskBalancer Service Metrics | Description
|
+|------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
+| `SuccessCount` | The number of successful balance
jobs.
|
+| `SuccessBytes` | The total bytes for successfully
balanced jobs.
|
+| `FailureCount` | The number of failed balance
jobs.
|
+| `moveSuccessTime` | The time spent on successful
container moves.
|
+| `moveFailureTime` | The time spent on failed
container moves.
|
+| `runningLoopCount` | The total number of times the
balancer's main loop has run.
|
+| `idleLoopNoAvailableVolumePairCount ` | The number of loops where
balancing did not run because no suitable source/destination volume pair could
be found. |
+| `idleLoopExceedsBandwidthCount` | The number of loops where
balancing did not run due to bandwidth limits.
|
+
diff --git a/hadoop-hdds/docs/content/feature/DiskBalancer.md
b/hadoop-hdds/docs/content/feature/DiskBalancer.md
new file mode 100644
index 0000000000..6ef86ec965
--- /dev/null
+++ b/hadoop-hdds/docs/content/feature/DiskBalancer.md
@@ -0,0 +1,122 @@
+---
+title: "DiskBalancer"
+weight: 1
+menu:
+ main:
+ parent: Features
+summary: DiskBalancer For DataNodes.
+---
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+## Overview
+**Apache Ozone** works well to distribute all containers evenly across all
multiple disks on each Datanode.
+This initial spread ensures that I/O load is balanced from the start. However,
over the operational lifetime of a
+cluster **disk imbalance** can occur due to the following reasons:
+- **Adding new disks** to expand datanode storage space.
+- **Replacing old broken disks** with new disks.
+- Massive **block** or **replica deletions**.
+
+This uneven utilisation of disks can create performance bottlenecks, as
**over-utilised disks** become **hotspots**
+limiting the overall throughput of the Datanode. As a result, this new
feature, **DiskBalancer**, is introduced to
+ensure even data distribution across disks within a Datanode.
+
+A disk is considered a candidate for balancing if its
+`VolumeDataDensity` exceeds a configurable `threshold`. DiskBalancer can be
triggered manually by **CLI commands**.
+
+
+
+
+## Command Line Usage
+The DiskBalancer is managed through the `ozone admin datanode diskbalancer`
command.
+
+### **Start DiskBalancer**
+To start diskBalancer on all Datanodes with default configurations :
+
+```shell
+ozone admin datanode diskbalancer start -a
+```
+
+You can also start DiskBalancer with specific options:
+
+```shell
+ozone admin datanode diskbalancer start [options]
+```
+
+### **Update Configurations**
+To update DiskBalancer configurations you can use following command:
+
+```shell
+ozone admin datanode diskbalancer update [options]
+```
+**Options include:**
+
+| Options | Description
|
+|---------------------------------------|-------------------------------------------------------------------------------------------------------|
+| `-t, --threshold` | Percentage deviation from average
utilization of the disks after which a datanode will be rebalanced. |
+| `-b, --bandwithInMB` | Maximum bandwidth for DiskBalancer
per second. |
+| `-p, --parallelThread` | Max parallelThread for DiskBalancer.
|
+| `-s, --stop-after-disk-even` | Stop DiskBalancer automatically
after disk utilization is even. |
+| `-a, --all` | Run commands on all datanodes.
|
+| `-d, --datanodes` | Run commands on specific datanodes
|
+
+### **Stop DiskBalancer**
+To stop DiskBalancer on all Datanodes:
+
+```shell
+ozone admin datanode diskbalancer stop -a
+```
+You can also stop DiskBalancer on specific Datanodes:
+
+```shell
+ozone admin datanode diskbalancer stop -d <datanode1>
+```
+### **DiskBalancer Status**
+To check the status of DiskBalancer on all Datanodes:
+
+```shell
+ozone admin datanode diskbalancer status
+```
+You can also check status of DiskBalancer on specific Datanodes:
+
+```shell
+ozone admin datanode diskbalancer status -d <datanode1>
+```
+### **DiskBalancer Report**
+To get a **volumeDataDensity** of DiskBalancer of top **N** Datanodes
(displayed in descending order),
+by default N=25, if not specified:
+
+```shell
+ozone admin datanode diskbalancer report --count <N>
+```
+
+## **DiskBalancer Configurations**
+
+The DiskBalancer's behavior can be controlled using the following
configuration properties in `ozone-site.xml`.
+
+| Property | Default Value
|
Description
|
+| ------------------------------------------------------------
|----------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `hdds.datanode.disk.balancer.volume.density.threshold` | `10.0`
| A
percentage (0-100). A datanode is considered balanced if for each volume, its
utilization differs from the average datanode utilization by no more than this
threshold. |
+| `hdds.datanode.disk.balancer.max.disk.throughputInMBPerSec` | `10`
| The
maximum bandwidth (in MB/s) that the balancer can use for moving data, to avoid
impacting client I/O.
|
+| `hdds.datanode.disk.balancer.parallel.thread` | `5`
| The
number of worker threads to use for moving containers in parallel.
|
+| `hdds.datanode.disk.balancer.service.interval` | `60s`
| The
time interval at which the Datanode DiskBalancer service checks for imbalance
and updates its configuration.
|
+| `hdds.datanode.disk.balancer.stop.after.disk.even` | `true`
| If
true, the DiskBalancer will automatically stop its balancing activity once
disks are considered balanced (i.e., all volume densities are within the
threshold). |
+| `hdds.datanode.disk.balancer.volume.choosing.policy` |
`org.apache.hadoop.ozone.container.diskbalancer.policy.DefaultVolumeChoosingPolicy`
| The policy class for selecting source and destination volumes for
balancing.
|
+| `hdds.datanode.disk.balancer.container.choosing.policy` |
`org.apache.hadoop.ozone.container.diskbalancer.policy.DefaultContainerChoosingPolicy`
| The policy class for selecting which containers to move from a source volume
to destination volume.
|
+| `hdds.datanode.disk.balancer.service.timeout` | `300s`
|
Timeout for the Datanode DiskBalancer service operations.
|
+| `hdds.datanode.disk.balancer.should.run.default` | `false`
| If
the balancer fails to read its persisted configuration, this value determines
if the service should run by default.
|
+
diff --git a/hadoop-hdds/docs/content/feature/DiskBalancer.zh.md
b/hadoop-hdds/docs/content/feature/DiskBalancer.zh.md
new file mode 100644
index 0000000000..a44fb07cd7
--- /dev/null
+++ b/hadoop-hdds/docs/content/feature/DiskBalancer.zh.md
@@ -0,0 +1,116 @@
+---
+title: "磁盘均衡器"
+weight: 1
+menu:
+ main:
+ parent: 特征
+summary: 数据节点的磁盘平衡器.
+---
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+## 概述
+**Apache Ozone** 能够有效地将所有容器均匀分布在每个 Datanode 的多个磁盘上。这种初始分布确保了 I/O
负载从一开始就保持平衡。然而,
+在集群的整个运行生命周期中,以下原因可能会导致磁盘负载不平衡:
+- **添加新磁盘**以扩展数据节点存储空间。
+- **用新磁盘**替换损坏的旧磁盘**。
+- 大量**块**或**副本删除**。
+
+磁盘利用率不均衡会造成性能瓶颈,因为**过度利用的磁盘**会成为**热点**,限制 Datanode
的整体吞吐量。因此,我们引入了这项新功能**磁盘平衡器**
+,以确保 Datanode 内各个磁盘的数据均匀分布。
+
+如果磁盘的`VolumeDataDensity`超过可配置的“阈值”,则该磁盘被视为需要进行平衡。DiskBalancer 可以通过**CLI
命令**手动触发。
+
+
+
+## 命令行用法
+DiskBalancer 通过 `ozone admin datanode diskbalancer` 命令进行管理。
+
+### **启动 DiskBalancer**
+要在所有 Datanode 上使用默认配置启动 DiskBalancer,请执行以下操作:
+
+```shell
+ozone admin datanode diskbalancer start -a
+```
+
+您还可以使用特定选项启动 DiskBalancer:
+```shell
+ozone admin datanode diskbalancer start [options]
+```
+
+### **更新配置**
+要更新 DiskBalancer 配置,您可以使用以下命令:
+
+```shell
+ozone admin datanode diskbalancer update [options]
+```
+**选项包括:**
+
+| 选项 | 描述 |
+|------------------------------|---------------------------------|
+| `-t, --threshold` | 与磁盘平均利用率的百分比偏差,超过此偏差,数据节点将重新平衡。 |
+| `-b, --bandwithInMB` | DiskBalancer 每秒的最大带宽。 |
+| `-p, --parallelThread` | DiskBalancer 的最大并行线程。 |
+| `-s, --stop-after-disk-even` | 磁盘利用率达到均匀后自动停止 DiskBalancer。 |
+| `-a, --all` | 在所有数据节点上运行命令。 |
+| `-d, --datanodes` | 在特定数据节点上运行命令 |
+
+### **停止 DiskBalancer**
+要停止所有 Datanode 上的 DiskBalancer,请执行以下操作:
+
+```shell
+ozone admin datanode diskbalancer stop -a
+```
+您还可以停止特定 Datanode 上的 DiskBalancer:
+
+```shell
+ozone admin datanode diskbalancer stop -d <datanode1>
+```
+### **磁盘平衡器状态**
+要检查所有数据节点上的磁盘平衡器状态,请执行以下操作:
+
+```shell
+ozone admin datanode diskbalancer status
+```
+您还可以检查特定 Datanode 上 DiskBalancer 的状态:
+```shell
+ozone admin datanode diskbalancer status -d <datanode1>
+```
+### **磁盘平衡器报告**
+要获取前**N**个数据节点(按降序显示)的磁盘平衡器**volumeDataDensity**,
+默认 N=25,如未指定:
+
+```shell
+ozone admin datanode diskbalancer report --count <N>
+```
+
+## DiskBalancer Configurations
+
+The DiskBalancer's behavior can be controlled using the following
configuration properties in `ozone-site.xml`.
+
+| Property | Default Value
| Description
|
+| ------------------------------------------------------------
|----------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `hdds.datanode.disk.balancer.volume.density.threshold` | `10.0`
|
百分比(0-100)。如果对于每个卷,其利用率与平均数据节点利用率之差不超过此阈值,则认为数据节点处于平衡状态。 |
+| `hdds.datanode.disk.balancer.max.disk.throughputInMBPerSec` | `10`
| 平衡器可用于移动数据的最大带宽(以 MB/s 为单位),以避免影响客户端 I/O。
|
+| `hdds.datanode.disk.balancer.parallel.thread` | `5`
| 用于并行移动容器的工作线程数。
|
+| `hdds.datanode.disk.balancer.service.interval` | `60s`
| Datanode DiskBalancer 服务检查不平衡并更新其配置的时间间隔。
|
+| `hdds.datanode.disk.balancer.stop.after.disk.even` | `true`
| 如果为真,则一旦磁盘被视为平衡(即所有卷密度都在阈值内),DiskBalancer
将自动停止其平衡活动。 |
+| `hdds.datanode.disk.balancer.volume.choosing.policy` |
`org.apache.hadoop.ozone.container.diskbalancer.policy.DefaultVolumeChoosingPolicy`
| 用于选择平衡的源卷和目标卷的策略类。
|
+| `hdds.datanode.disk.balancer.container.choosing.policy` |
`org.apache.hadoop.ozone.container.diskbalancer.policy.DefaultContainerChoosingPolicy`
| 用于选择将哪些容器从源卷移动到目标卷的策略类。
|
+| `hdds.datanode.disk.balancer.service.timeout` | `300s`
| Datanode DiskBalancer 服务操作超时。
|
+| `hdds.datanode.disk.balancer.should.run.default` | `false`
| 如果平衡器无法读取其持久配置,则该值决定服务是否应默认运行。
|
+
diff --git a/hadoop-hdds/docs/content/feature/diskBalancer.png
b/hadoop-hdds/docs/content/feature/diskBalancer.png
new file mode 100644
index 0000000000..5c146bf062
Binary files /dev/null and b/hadoop-hdds/docs/content/feature/diskBalancer.png
differ
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]