This is an automated email from the ASF dual-hosted git repository.
hanahmily pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/skywalking-banyandb.git
The following commit(s) were added to refs/heads/main by this push:
new 4f25a9de6 init (#1062)
4f25a9de6 is described below
commit 4f25a9de6aa606937e6060b2387fbc8bdf40e66e
Author: OmCheeLin <[email protected]>
AuthorDate: Sun Apr 12 07:38:18 2026 +0800
init (#1062)
---
docs/concept/clustering.md | 2 ++
docs/concept/distributed-measure-aggregation.md | 45 +++++++++++++++++++++++++
docs/menu.yml | 2 ++
3 files changed, 49 insertions(+)
diff --git a/docs/concept/clustering.md b/docs/concept/clustering.md
index 43f368ea5..8bfce2cac 100644
--- a/docs/concept/clustering.md
+++ b/docs/concept/clustering.md
@@ -29,6 +29,8 @@ Liaison Nodes serve as gateways, routing traffic to Data
Nodes. In addition to r
Liaison Nodes are also responsible for handling computational tasks associated
with distributed querying the database. They build query tasks and search for
data from Data Nodes.
+For measure queries that include aggregation, data nodes perform a local
**map** phase and the liaison performs a **reduce** phase so results stay
correct across shards while limiting raw data on the network. See [Distributed
Measure Aggregation](distributed-measure-aggregation.md).
+
### 1.4 Standalone Mode
BanyanDB integrates multiple roles into a single process in the standalone
mode, making it simpler and faster to deploy. This mode is especially useful
for scenarios with a limited number of data points or for testing and
development purposes.
diff --git a/docs/concept/distributed-measure-aggregation.md
b/docs/concept/distributed-measure-aggregation.md
new file mode 100644
index 000000000..f53b965c1
--- /dev/null
+++ b/docs/concept/distributed-measure-aggregation.md
@@ -0,0 +1,45 @@
+# Distributed Measure Aggregation
+
+In a cluster, a measure query may touch many shards spread across **Data
Nodes**. When the query includes an **aggregation** (`agg` in the API),
BanyanDB splits the work into two logical phases so results stay correct while
limiting how much raw data moves over the network.
+
+This page describes that design at a high level. For roles of liaisons and
data nodes in general, see [Clustering](clustering.md). For how to issue
measure queries from clients, see [Query
Measures](../interacting/bydbctl/query/measure.md).
+
+## Map on Data Nodes, Reduce on the Liaison
+
+**Map phase (Data Nodes)**
+Each data node scans its local shards, applies the same filters and time range
as the user query, and runs the aggregation over the matching **raw** field
values. Instead of sending every raw point upstream, the node sends a compact
**intermediate** representation of the aggregation for each logical result row
(for example one row per group when `group_by` is used).
+
+**Reduce phase (Liaison Node)**
+The liaison collects those intermediate values from all participating nodes,
**combines** them with the same aggregation function, and produces the final
value the client expects. Combining partial sums into a total sum, or taking
the max of partial maxima, are typical examples.
+
+So: data nodes do the heavy lifting near storage; the liaison merges
shard-level (and replica-deduplicated) partials into one answer.
+
+## Why Two Phases
+
+- **Correctness across shards**: A group or time bucket may span multiple
shards on different nodes. Each node can only aggregate what it sees locally;
the liaison merges those local summaries into a global result.
+- **Efficient `MEAN`**: The global mean is not the mean of local means. The
map phase tracks **sum and count** (conceptually); the reduce phase adds sums
and counts from all nodes, then divides. That requires an intermediate form
richer than a single number for the map output.
+- **Less data on the wire**: For aggregated queries, nodes send partial
aggregates instead of full raw series, which scales better as data volume grows.
+
+## Supported Aggregation Functions
+
+The same aggregation functions you use on a single node are supported in
distributed mode: **SUM**, **COUNT**, **MAX**, **MIN**, and **MEAN**. The map
and reduce steps are implemented so each function composes safely across shards
(for example COUNT uses count-like partials that are summed at the liaison,
analogous to SUM).
+
+## Replicas and Deduplication
+
+The same shard may be read from more than one replica for availability. Before
the reduce step, the liaison **deduplicates** map results that represent the
same shard (and the same group key when `group_by` is used), so replica
responses are not counted twice. Operators do not configure this; it is part of
query execution.
+
+## What This Means for Operations
+
+- **CPU**: Data nodes spend more CPU on aggregated queries because aggregation
runs where the data lives. The liaison mainly merges partials, which is
comparatively light for typical workloads.
+- **Network**: Aggregated queries generally move less payload than fetching
all raw points and aggregating only on the liaison.
+- **Scaling**: Adding data nodes spreads map work across the cluster for
shard-local portions of the query; the liaison still performs one reduce pass
over the combined partial set, so liaison capacity remains relevant for very
large fan-out.
+
+## Standalone Mode
+
+In standalone deployment, a single process plays both roles. The same
aggregation logic applies locally without cross-node traffic; there is no
separate “wire format” step from an operator’s perspective.
+
+## Related Material
+
+- [Clustering](clustering.md) — liaison vs data node responsibilities and
routing
+- [Query Measures](../interacting/bydbctl/query/measure.md) — measure query
examples
+- [API reference](../api-reference.md) — measure query and internal
`agg_return_partial` (used between liaison and data nodes for partial
aggregates)
\ No newline at end of file
diff --git a/docs/menu.yml b/docs/menu.yml
index 32d5c2cf9..caaf53e16 100644
--- a/docs/menu.yml
+++ b/docs/menu.yml
@@ -179,6 +179,8 @@ catalog:
catalog:
- name: "Clustering"
path: "/concept/clustering"
+ - name: "Distributed Measure Aggregation"
+ path: "/concept/distributed-measure-aggregation"
- name: "TSDB"
path: "/concept/tsdb"
- name: "Data Rotation"