This is an automated email from the ASF dual-hosted git repository.

sumitagrawal pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/ozone.git


The following commit(s) were added to refs/heads/master by this push:
     new f7270513a5 HDDS-12929. Datanode Should Immediately Trigger Container 
Close when Volume Full (#8460)
f7270513a5 is described below

commit f7270513a5f848f7e43c6d33acfeb2d238393311
Author: Siddhant Sangwan <[email protected]>
AuthorDate: Mon Jul 7 11:24:38 2025 +0530

    HDDS-12929. Datanode Should Immediately Trigger Container Close when Volume 
Full (#8460)
---
 .../docs/content/design/full-volume-handling.md    | 163 +++++++++++++++++++++
 1 file changed, 163 insertions(+)

diff --git a/hadoop-hdds/docs/content/design/full-volume-handling.md 
b/hadoop-hdds/docs/content/design/full-volume-handling.md
new file mode 100644
index 0000000000..fcc555882d
--- /dev/null
+++ b/hadoop-hdds/docs/content/design/full-volume-handling.md
@@ -0,0 +1,163 @@
+---
+title: Full Volume Handling
+summary: Immediately trigger Datanode heartbeat on detecting full volume
+date: 2025-05-12
+jira: HDDS-12929
+status: implemented 
+author: Siddhant Sangwan, Sumit Agrawal
+---
+
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+> **Note**: The feature described here was implemented in [pull request 
#8590](https://github.com/apache/ozone/pull/8590). This document reflects the 
final, merged design.
+
+## Summary
+Trigger datanode heartbeat immediately when the container being written to is 
(close) to full, or volume is full, 
+or container is unhealthy, while handling a write request. The immediate 
heartbeat will contain close container action. 
+The overall objective is to avoid Datanode disks from getting completely full 
and preventing degradation of write 
+performance. 
+
+## Problem
+When a Datanode volume is close to full, the SCM may not be immediately aware 
of this because storage reports are only
+sent to it every one minute (`HDDS_NODE_REPORT_INTERVAL_DEFAULT = "60s"`). 
Additionally, SCM only has stale
+information about the current size of a container because container size is 
only updated when an Incremental Container
+Report (event based, for example when a container transitions from open to 
closing state) is received or a Full
+Container Report (`HDDS_CONTAINER_REPORT_INTERVAL_DEFAULT = "60m"`) is 
received. This can lead to the SCM
+over-allocating blocks to containers on a Datanode volume that has already 
reached the min free space boundary. 
+
+In the future, in https://issues.apache.org/jira/browse/HDDS-12151 we plan to 
fail writes for containers on a volume 
+that has reached the min free space boundary. 
+So, in the future, when the Datanode fails writes for such a volume, overall 
write performance will drop because the 
+client will have to request for a different set of blocks.
+
+Before this change, a close container action was queued to be sent in the next 
heartbeat, when:
+1. Container was at 90% capacity.
+2. Volume was full, __counting committed space__. That is `available - 
reserved - committed - min free space <= 0`.
+3. Container was `UNHEALTHY`.
+
+But since the next heartbeat could be sent up to 30 seconds later in the worst 
case, this reaction time was too slow. 
+This design proposes sending Datanode heartbeat immediately so SCM can get the 
close container action 
+immediately. This will help in reducing performance drop because of write 
failures when 
+https://issues.apache.org/jira/browse/HDDS-12151 is implemented in the future.
+
+### The definition of a full volume
+Previously, a volume was considered full if the following method returned 
true. It accounts for available space, 
+committed space, min free space and reserved space (`available` already 
considers `reserved` space):
+```java
+  private static long getUsableSpace(
+      long available, long committed, long minFreeSpace) {
+    return available - committed - minFreeSpace;
+  }
+```
+Counting committed space here, _when sending close container action_, is a bug 
- we only want to close the container if 
+`available - reserved - minFreeSpace <= 0`. 
+
+## Non Goals
+Failing the write if it exceeds the min free space boundary 
(https://issues.apache.org/jira/browse/HDDS-12151) is not 
+discussed here.
+
+## Proposed Solution
+
+### Proposal for immediately triggering Datanode heartbeat
+
+We will immediately trigger the heartbeat when:
+1. The container is (close) to full (this is existing behaviour, the container 
full check already exists).
+2. The volume is __full EXCLUDING committed space__ (`available - reserved 
available - min free <= 0`). This is because 
+   when a volume 
+   is full INCLUDING committed space (`available - reserved available - 
committed - min free <= 0`), open containers 
+   can still accept writes. So the current behaviour of sending a close 
container action when volume is full including 
+   committed space is a bug.
+3. The container is unhealthy (this is existing behaviour).
+
+Logic to trigger a heartbeat immediately already exists - we just need to call 
the method when needed. So, in 
+`HddsDispatcher`, when handling a request: 
+1. For every write request:
+   1. Check the above three conditions
+      1. If true, queue the `ContainerAction` to the context as before, then 
immediately trigger the heartbeat using:
+```
+context.getParent().triggerHeartbeat();
+```
+
+#### Throttling
+Throttling is required so the Datanode doesn't send multiple immediate 
heartbeats for the same container. We can have 
+per-container throttling, where we trigger an immediate heartbeat once for a 
particular container, and after that 
+container actions for that container can only be sent in the regular, 
scheduled heartbeats. Meanwhile, immediate 
+heartbeats can still be triggered for different containers.
+
+Here's a visualisation to show this. The letters (A, B, C etc.) denote events 
and timestamp is the time at which 
+an event occurs.
+```
+Write Call 1:
+/ A, timestamp: 0/-------------/B, timestamp: 5/
+
+Write Call 2, in-parallel with 1:
+------------------------------ /C, timestamp: 5/
+
+Write Call 3, in-parallel with 1 and 2:
+---------------------------------------/D, timestamp: 7/
+
+Write Call 4:
+------------------------------------------------------------------------/E, 
timestamp: 30/
+
+Events:
+A: Last, regular heartbeat
+B: Volume 1 detected as full while writing to Container 1, heartbeat triggered
+C: Volume 1 again detected as full while writing to Container 2, heartbeat 
triggered
+D: Container 1 detected as full, heartbeat throttled
+E: Volume 1 detected as full while writing to Container 2, Container Action 
sent in regular heartbeat 
+```
+
+A simple and thread safe way to implement this is to have an `AtomicBoolean` 
for each container in the `ContainerData` 
+class that can be used to check if an immediate heartbeat has already been 
triggered for that container. The memory 
+impact of introducing a new member variable in `ContainerData` for each 
container is quite small. Consider:
+
+- A Datanode with 600 TB disk space.
+- Container size is 5 GB.
+- Number of containers = 600 TB / 5 GB = 120,000.
+- For each container, we'll have an `AtomicBoolean`, which just has one field 
that counts: `private volatile int value`.
+  - Static fields don't count.
+- An int is 4 bytes in Java. Assuming that other overhead for a Java object 
brings the size of the object to ~20 bytes.
+- 20 bytes * 120,000 = ~2.4 MB.
+
+For code implementation, see https://github.com/apache/ozone/pull/8590.
+
+## Alternatives
+### Preventing over allocation of blocks in the SCM
+The other part of the problem is that SCM has stale information about the size 
of the container and ends up
+over-allocating blocks (beyond the container's 5 GB size) to the same 
container. Solving this problem is complicated.
+We could track how much space we've allocated to a container in the SCM - this 
is doable on the surface but won't
+actually work well. That's because SCM is asked for a block (256 MB), but SCM 
doesn't know how much data a client will
+actually write to that block file. The client may only write 1 MB, for 
example. So SCM could track that it has already
+allocated 5 GB to a container, and will open another container for incoming 
requests, but the client may actually only
+write 1GB. This would lead to a lot of open containers when we have 10k 
requests/second.
+
+At this point, we've decided not to do anything about this.
+
+### Regularly sending open container reports
+Sending open container reports regularly (every 30 seconds for example) can 
help a little bit, but won't solve the
+problem. We won't take this approach for now.
+
+### Sending storage reports in immediate heartbeats
+We considered triggering an immediate heartbeat every time the Datanode 
detects a volume is full 
+while handling a write request, with per volume throttling. To each immediate 
heartbeat, we would the storage 
+reports of all volumes in the Datanode + Close Container Action for the 
particular container being written to. While 
+this would update the storage stats of a volume in the SCM faster, which the 
SCM can subsequently use to decide 
+whether to include that Datanode in a new pipeline, the per volume throttling 
has a drawback. It wouldn't let us 
+send close container actions for other containers. We decided not to take this 
approach.  
+
+## Implementation Plan
+1. HDDS-13045: For triggering immediate heartbeat. Already merged - 
https://github.com/apache/ozone/pull/8590.
+2. HDDS-12151: Fail a write call if it exceeds min free space boundary (not 
the focus of this doc, just mentioned 
+   here).


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to