This is an automated email from the ASF dual-hosted git repository.

tejaskriya pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/ozone.git


The following commit(s) were added to refs/heads/master by this push:
     new f4337adef54 HDDS-13382. Add RocksDB documentation page (#8740)
f4337adef54 is described below

commit f4337adef548a148673a62c70076d93e6bdd2471
Author: Wei-Chiu Chuang <[email protected]>
AuthorDate: Tue Oct 14 22:47:12 2025 -0700

    HDDS-13382. Add RocksDB documentation page (#8740)
---
 hadoop-hdds/docs/content/concept/RocksDB.md | 161 ++++++++++++++++++++++++++++
 1 file changed, 161 insertions(+)

diff --git a/hadoop-hdds/docs/content/concept/RocksDB.md 
b/hadoop-hdds/docs/content/concept/RocksDB.md
new file mode 100644
index 00000000000..cd100b558d8
--- /dev/null
+++ b/hadoop-hdds/docs/content/concept/RocksDB.md
@@ -0,0 +1,161 @@
+---
+title: "RocksDB in Apache Ozone"
+menu:
+  main:
+    parent: Architecture
+---
+
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+> Note: This page covers advanced topics. Ozone administrators typically do 
not need to tinker with these settings.
+
+RocksDB is a critical component of Apache Ozone, providing a high-performance 
embedded key-value store. It is used by various Ozone services to persist 
metadata and state.
+
+## 1. Introduction to RocksDB
+
+RocksDB is a log-structured merge-tree (LSM-tree) based key-value store 
developed by Facebook. It is optimized for fast storage environments like SSDs 
and offers high write throughput and efficient point lookups. For more details, 
refer to the [RocksDB GitHub project](https://github.com/facebook/rocksdb) and 
the [RocksDB Wiki](https://github.com/facebook/rocksdb/wiki).
+
+## 2. How Ozone uses RocksDB
+
+RocksDB is utilized in the following Ozone components to store critical 
metadata:
+
+*   **Ozone Manager (OM):** The OM uses RocksDB as its primary metadata store, 
holding the entire namespace and related information. As defined in 
`OMDBDefinition.java`, this includes tables for:
+    *   **Namespace:** `volumeTable`, `bucketTable`, `keyTable` (for object 
store layout), `directoryTable`, and `fileTable` (for file system layout).
+    *   **Security:** `userTable`, `dTokenTable` (delegation tokens), and 
`s3SecretTable`.
+    *   **State Management:** `transactionInfoTable` for tracking 
transactions, `deletedTable` for pending key deletions, and `snapshotInfoTable` 
for managing Ozone snapshots.
+
+*   **Storage Container Manager (SCM):** The SCM persists the state of the 
storage layer in RocksDB. The structure, defined in `SCMDBDefinition.java`, 
includes tables for:
+    *   `pipelines`: Manages the state and composition of data pipelines.
+    *   `containers`: Stores information about all storage containers in the 
cluster.
+    *   `deletedBlocks`: Tracks blocks that are marked for deletion and 
awaiting garbage collection.
+    *   `move`: Coordinates container movements for data rebalancing.
+    *   `validCerts`: Stores certificates for validating datanodes.
+    *   `validSCMCerts`: Stores certificates for validating SCMs.
+    *   `scmTransactionInfos`: Tracks SCM transactions.
+    *   `sequenceId`: Manages sequence IDs for various SCM operations.
+    *   `meta`: Stores miscellaneous SCM metadata, like upgrade status.
+    *   `statefulServiceConfig`: Stores configurations for stateful services.
+
+*   **Datanode:** A Datanode utilizes RocksDB for two main purposes:
+    1.  **Per-Volume Metadata:** It maintains one RocksDB instance per storage 
volume. Each of these instances manages metadata for the containers and blocks 
stored on that specific volume. As specified in 
`DatanodeSchemaThreeDBDefinition.java`, this database is structured with column 
families for `block_data`, `metadata`, `delete_txns`, `finalize_blocks`, and 
`last_chunk_info`. To optimize performance, it uses a fixed-length prefix based 
on the container ID, enabling efficient lookups w [...]
+    2.  **Global Container Tracking:** Additionally, each Datanode has a 
single, separate RocksDB instance to record the set of all containers it 
manages. This database, defined in `WitnessedContainerDBDefinition.java`, 
contains a `ContainerCreateInfoTable` table that provides a complete index of 
the containers hosted on that Datanode.
+
+*   **Recon:** Ozone's administration and monitoring tool, Recon, maintains 
its own RocksDB database to store aggregated and historical data for analysis. 
The `ReconDBDefinition.java` outlines tables for:
+    *   `containerKeyTable`: Maps containers to the keys they contain.
+    *   `namespaceSummaryTable`: Stores aggregated namespace information for 
quick reporting.
+    *   `replica_history`: Tracks the historical locations of container 
replicas, which is essential for auditing and diagnostics.
+    *   `keyContainerTable`: Maps keys to the containers they are in.
+    *   `containerKeyCountTable`: Stores the number of keys in each container.
+    *   `replica_history_v2`: Tracks the historical locations of container 
replicas with BCSID, which is essential for auditing and diagnostics.
+
+## 3. Tunings applicable to RocksDB
+
+Effective tuning of RocksDB can significantly impact Ozone's performance. 
Ozone exposes several configuration properties to tune RocksDB behavior. These 
properties are typically found in `ozone-default.xml` and can be overridden in 
`ozone-site.xml`.
+
+### General Settings
+
+Ozone provides a set of general RocksDB configurations that apply to all 
services (OM, SCM, and Datanodes) unless overridden by more specific settings. 
With the exception of `hdds.db.profile` and 
`ozone.metastore.rocksdb.cf.write.buffer.size`, these properties are defined in 
`RocksDBConfiguration.java`.
+
+*   `hdds.db.profile`: Specifies the RocksDB profile to use, which determines 
the default `DBOptions` and `ColumnFamilyOptions`. Default value: `DISK`.
+    *   Possible values include `SSD` and `DISK`.
+    *   For example, setting this to `SSD` will apply tunings optimized for 
SSD storage.
+
+*   **Write Options:**
+    *   `hadoop.hdds.db.rocksdb.writeoption.sync`: If set to `true`, writes 
are synchronized to persistent storage, ensuring durability at the cost of 
performance. If `false`, writes are flushed asynchronously. Default: `false`.
+
+*   `ozone.metastore.rocksdb.cf.write.buffer.size`: The write buffer 
(memtable) size for each column family of the rocksdb store. Default: `128MB`.
+
+*   **Write-Ahead Log (WAL) Management:**
+    *   `hadoop.hdds.db.rocksdb.WAL_ttl_seconds`: The time-to-live for WAL 
files in seconds. Default: `1200`.
+    *   `hadoop.hdds.db.rocksdb.WAL_size_limit_MB`: The total size limit for 
WAL files in megabytes. When this limit is exceeded, the oldest WAL files are 
deleted. A value of `0` means no limit. Default: `0`.
+
+*   **Logging:**
+    *   `hadoop.hdds.db.rocksdb.logging.enabled`: Enables or disables 
RocksDB's own logging. Default: `false`.
+    *   `hadoop.hdds.db.rocksdb.logging.level`: The logging level for RocksDB 
(INFO, DEBUG, WARN, ERROR, FATAL). Default: `INFO`.
+    *   `hadoop.hdds.db.rocksdb.max.log.file.size`: The maximum size of a 
single RocksDB log file. Default: `100MB`.
+    *   `hadoop.hdds.db.rocksdb.keep.log.file.num`: The maximum number of 
RocksDB log files to retain. Default: `10`.
+
+### Ozone Manager (OM) Specific Settings
+
+These settings, defined in `ozone-default.xml`, apply specifically to the 
Ozone Manager.
+
+*   `ozone.om.db.max.open.files`: The total number of files that a RocksDB can 
open in the OM. Default: `-1` (unlimited).
+*   `ozone.om.compaction.service.enabled`: Enable or disable a background job 
that periodically compacts rocksdb tables flagged for compaction. Default: 
`false`.
+*   `ozone.om.compaction.service.run.interval`: The interval for the OM's 
compaction service. Default: `6h`.
+*   `ozone.om.compaction.service.timeout`: Timeout for the OM's compaction 
service. Default: `10m`.
+*   `ozone.om.compaction.service.columnfamilies`: A comma-separated list of 
column families to be compacted by the service. Default: 
`keyTable,fileTable,directoryTable,deletedTable,deletedDirectoryTable,multipartInfoTable`.
+
+### DataNode-Specific Settings
+
+These settings, defined in `DatanodeConfiguration.java`, apply specifically to 
Datanodes and will override the general settings where applicable.
+
+Key tuning parameters for the DataNode often involve:
+
+*   **Memory usage:** Configuring block cache, write buffer manager, and other 
memory-related settings.
+    *   `hdds.datanode.metadata.rocksdb.cache.size`: Configures the block 
cache size for RocksDB instances on Datanodes. Default value: `1GB`.
+*   **Compaction strategies:** Optimizing how data is merged and organized on 
disk. For more details, refer to the [Merge Container RocksDB in DN 
Documentation]({{< ref "feature/dn-merge-rocksdb.md" >}}).
+    *   `hdds.datanode.rocksdb.auto-compaction-small-sst-file`: Enables or 
disables auto-compaction for small SST files. Default value: `true`.
+    *   `hdds.datanode.rocksdb.auto-compaction-small-sst-file-size-threshold`: 
Threshold for small SST file size for auto-compaction. Default value: `1MB`.
+    *   `hdds.datanode.rocksdb.auto-compaction-small-sst-file-num-threshold`: 
Threshold for the number of small SST files for auto-compaction. Default value: 
`512`.
+    *   
`hdds.datanode.rocksdb.auto-compaction-small-sst-file.interval.minutes`: Auto 
compact small SST files interval in minutes. Default value: `120`.
+    *   `hdds.datanode.rocksdb.auto-compaction-small-sst-file.threads`: Auto 
compact small SST files threads. Default value: `1`.
+*   **Write-ahead log (WAL) settings:** Balancing durability and write 
performance.
+    *   `hdds.datanode.rocksdb.log.max-file-size`: The max size of each user 
log file of RocksDB. O means no size limit. Default value: `32MB`.
+    *   `hdds.datanode.rocksdb.log.max-file-num`: The max user log file number 
to keep for each RocksDB. Default value: `64`.
+*   **Logging:**
+    *   `hdds.datanode.rocksdb.log.level`: The user log level of 
RocksDB(DEBUG/INFO/WARN/ERROR/FATAL)). Default: `INFO`.
+*   **Other Settings:**
+    *   `hdds.datanode.rocksdb.delete-obsolete-files-period`: Periodicity when 
obsolete files get deleted. Default is 1h.
+    *   `hdds.datanode.rocksdb.max-open-files`: The total number of files that 
a RocksDB can open. Default: `1024`.
+
+## 4. Troubleshooting and repair tools relevant to RocksDB
+
+Troubleshooting RocksDB issues in Ozone often involves:
+
+*   Analyzing RocksDB logs for errors and warnings.
+*   Using RocksDB's built-in tools for inspecting database files:
+    *   
[**ldb**](https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool#ldb-tool):
 A command-line tool for inspecting and manipulating the contents of a RocksDB 
database.
+    *   
[**sst_dump**](https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool#sst-dump-tool):
 A command-line tool for inspecting the contents of SST (Static Table) files, 
which are the files that store the data in RocksDB.
+*   Understanding common RocksDB error codes and their implications.
+
+## 5. Version Compatibility
+
+Apache Ozone uses RocksDB version 7.7.3. It is recommended to use RocksDB 
tools of this version to ensure compatibility and avoid any potential issues.
+
+## 6. Monitoring and Metrics
+
+Monitoring RocksDB performance is crucial for maintaining a healthy Ozone 
cluster.
+
+*   **RocksDB Statistics:** Ozone can expose detailed RocksDB statistics. 
Enable this by setting `ozone.metastore.rocksdb.statistics` to `ALL` or 
`EXCEPT_DETAILED_TIMERS` in `ozone-site.xml`. Be aware that enabling detailed 
statistics can incur a performance penalty (5-10%).
+*   **Grafana Dashboards:** Ozone provides Grafana dashboards that visualize 
low-level RocksDB statistics. Refer to the [Ozone Monitoring Documentation]({{< 
ref "feature/Observability.md" >}}) for details on setting up monitoring and 
using these dashboards.
+
+## 7. Storage Sizing
+
+Properly sizing the storage for RocksDB instances is essential to prevent 
performance bottlenecks and out-of-disk errors. The requirements vary 
significantly for each Ozone component, and using dedicated, fast storage 
(SSDs) is highly recommended.
+
+*   **Ozone Manager (OM):**
+  *   **Baseline:** A minimum of **100 GB** should be reserved for the OM's 
RocksDB instance. The OM stores the entire namespace metadata (volumes, 
buckets, keys), so this is the most critical database in the cluster.
+  *   **With Snapshots:** Enabling Ozone Snapshots will substantially increase 
storage needs. Each snapshot preserves a view of the metadata, and the 
underlying data files (SSTs) cannot be deleted by compaction until a snapshot 
is removed. The exact requirement depends on the number of retained snapshots 
and the rate of change (creations/deletions) in the namespace. Monitor disk 
usage closely after enabling snapshots. For more details, refer to the [Ozone 
Snapshot Documentation]({{< ref  [...]
+
+*   **Storage Container Manager (SCM):**
+  *   SCM's metadata footprint (pipelines, containers, Datanode heartbeats) is 
much smaller than the OM's. A baseline of **20-50 GB** is typically sufficient 
for its RocksDB instance.
+
+*   **Datanode:**
+  *   The Datanode's RocksDB stores metadata for all containers and their 
blocks. Its size grows proportionally with the number of containers and blocks 
hosted on that Datanode.
+  *   **Rule of Thumb:** A good starting point is to reserve **0.1% to 0.5%** 
of the total data disk capacity for RocksDB metadata. For example, a Datanode 
with 100 TB of data disks should reserve between 100 GB and 500 GB for its 
RocksDB metadata.
+  *   Workloads with many small files will result in a higher block count and 
will require space on the higher end of this range.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to