This is an automated email from the ASF dual-hosted git repository.
eldenmoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new 2033d9dc10e Add performance test for variant doc mode and sparse
sharding (#3507)
2033d9dc10e is described below
commit 2033d9dc10e786d5075c89f9529bced355b8565b
Author: lihangyu <[email protected]>
AuthorDate: Tue Mar 31 17:34:59 2026 +0800
Add performance test for variant doc mode and sparse sharding (#3507)
---
.../semi-structured/variant-workload-guide.md | 65 ++++++++++++++-------
.../semi-structured/variant-workload-guide.md | 63 +++++++++++++-------
.../semi-structured/variant-workload-guide.md | 30 ++++++++--
.../semi-structured/variant-workload-guide.md | 63 +++++++++++++-------
.../images/variant/variant-bench-query-time-3x.svg | 53 +++++++++++++++++
static/images/variant/variant-bench-query-time.svg | 65 +++++++++++++++++++++
.../images/variant/variant-doc-mode-readpaths.png | Bin 0 -> 415834 bytes
.../semi-structured/variant-workload-guide.md | 30 ++++++++--
.../semi-structured/variant-workload-guide.md | 65 ++++++++++++++-------
9 files changed, 336 insertions(+), 98 deletions(-)
diff --git
a/docs/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
b/docs/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
index b342002a199..3dcf21b6eaa 100644
---
a/docs/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
+++
b/docs/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
@@ -60,7 +60,7 @@ You have a wide-JSON problem when path count keeps growing
and starts to create
## Key Concepts
-Before reading the templates below, make sure these terms are clear. Each is
explained in 2-3 lines; for implementation details, see [VARIANT](./VARIANT).
+Before reading the storage modes below, make sure these terms are clear. Each
is explained in 2-3 lines; for implementation details, see [VARIANT](./VARIANT).
**Subcolumnization.** When data is written into a `VARIANT` column, Doris
automatically discovers JSON paths and extracts hot paths as independent
columnar subcolumns for efficient analytics.
@@ -80,30 +80,29 @@ As shown above, hot paths (such as `user_id`, `page`) stay
as independent column

-**DOC mode.** Delays Subcolumnization at write time and additionally stores
the original JSON as a map-format stored field. This gives fast ingest and
efficient whole-document return at the cost of extra storage. Subcolumnization
still happens later during compaction.
+**DOC mode.** Delays Subcolumnization at write time and additionally stores
the original JSON as a map-format stored field (the **doc map**). This gives
fast ingest and efficient whole-document return at the cost of extra storage.
Subcolumnization still happens later during compaction.

As illustrated above, during write the JSON is preserved as-is into a Doc
Store for fast ingest. Subcolumns are extracted later during compaction. At
read time, path-based queries (e.g. `SELECT v['user_id']`) read from
materialized subcolumns at full columnar speed, while whole-document queries
(`SELECT v`) read directly from the Doc Store without reconstructing from
subcolumns.
-**Storage Format V3.** Decouples column metadata from the segment footer.
Recommended for any `VARIANT` table, especially wide JSON, because it
eliminates the metadata bottleneck when thousands of subcolumns exist.
+DOC mode has three distinct read paths depending on whether the queried path
has been materialized:
-## Recommended Decision Path
+
-
+- **DOC Materialized**: The queried path has already been extracted into a
subcolumn (after compaction or when `variant_doc_materialization_min_rows` is
met). Reads at full columnar speed, same as default VARIANT.
+- **DOC Map**: The queried path has not been materialized yet. The query falls
back to scanning the entire doc map to find the value — significantly slower on
wide JSON.
+- **DOC Map (Sharded)**: Same fallback, but with
`variant_doc_hash_shard_count` the doc map is distributed across multiple
physical columns, enabling parallel scan and much faster recovery.
-## Typical starting points
+**Storage Format V3.** Decouples column metadata from the segment footer.
Recommended for any `VARIANT` table, especially wide JSON, because it
eliminates the metadata bottleneck when thousands of subcolumns exist.
-- Event or audit logs where queries mostly filter on `event_type`, `user_id`,
or `app_version`: start with default `VARIANT`, then move to sparse columns
only if path count keeps growing.
-- Advertising, telemetry, or profile payloads with thousands of optional
attributes but analytics on a small hot set: start with sparse columns.
-- Model responses, trace snapshots, or archived JSON documents that are often
returned as complete payloads: start with DOC mode.
-- Order, payment, or device payloads where a few paths such as `order_id`,
`status`, or `device_id` must stay typed and indexed: add Schema Template only
for those paths.
+## Recommended Decision Path
-For most workloads, the default configuration is already the right starting
point. Tune only when the access pattern is clearly unusual. Typical examples
include AI training feature payloads, connected-vehicle telemetry, and user-tag
systems that need unusually large-scale Subcolumnization together with many
path-level indexes.
+
-## Recommended templates
+## Storage Modes
-Use the table below to pick a starting point, then read the matching template.
+Use the table below to pick a starting point, then read the matching section.
| | Typical scenario | Recommended mode | Key configuration |
|---|---|---|---|
@@ -112,7 +111,7 @@ Use the table below to pick a starting point, then read the
matching template.
| **C** | Model output / trace / archives (ingest-first or whole-doc return) |
DOC mode + V3 | `variant_enable_doc_mode`,
`variant_doc_materialization_min_rows` |
| **D** | Orders / payments / devices (key paths need stable types) | Schema
Template + A or B | Define only key paths |
-### Template A: Default semi-structured analytics
+### Default Mode
This is the safest starting point for most new `VARIANT` workloads.
@@ -139,7 +138,7 @@ Watch for:
- Do not raise `variant_max_subcolumns_count` early unless path growth is
already causing pressure.
- If the JSON is not wide, enabling sparse columns or DOC mode adds complexity
without benefit.
-### Template B: Wide JSON with hot-path analytics
+### Sparse Mode
Choose sparse columns when the payload is wide, but most queries still focus
on a small set of hot paths.
@@ -170,9 +169,9 @@ Use it when the total key count is very large, but the
primary workload is still
Watch for:
- If hot-path analytics is the bottleneck, do not jump to DOC mode first.
-- `variant_max_subcolumns_count` defaults to `2048`, which is already the
right starting point for automatic Subcolumnization in most workloads. Do not
set it so large that effectively all paths go through Subcolumnization. If the
workload truly needs very large extracted-subcolumn scale, prefer [Template C
(DOC mode)](#doc-mode-template).
+- `variant_max_subcolumns_count` defaults to `2048`, which is already the
right starting point for automatic Subcolumnization in most workloads. Do not
set it so large that effectively all paths go through Subcolumnization. If the
workload truly needs very large extracted-subcolumn scale, prefer [DOC
Mode](#doc-mode-template).
-### Template C: Ingest-first or whole-document return {#doc-mode-template}
+### DOC Mode {#doc-mode-template}
Choose DOC mode when returning the whole JSON document or minimizing ingest
overhead matters more than optimizing path-based analytics.
@@ -180,10 +179,10 @@ Typical example: model responses, trace snapshots, or
archived JSON documents th
DOC mode helps when:
-- DOC mode is the more stable choice when Subcolumnization scale becomes
extremely large, for example around ten-thousand extracted columns. When paths
participating in Subcolumnization approach 10,000, hardware requirements rise
quickly, so DOC mode should be evaluated first.
-- Compared with default eager Subcolumnization, compaction memory can drop by
about two-thirds.
-- In sparse wide-column ingestion workloads, throughput can improve by about
5–10x.
-- If the column is very wide and queries often read the whole `VARIANT` value,
DOC mode can improve `SELECT variant_col` performance by orders of magnitude
compared with reconstructing the document from many subcolumns.
+- When Subcolumnization scale becomes extremely large (approaching 10,000
paths), hardware requirements rise quickly. DOC mode is the more stable choice
at this scale.
+- Compaction memory can drop by about two-thirds compared with default eager
Subcolumnization.
+- In sparse wide-column ingestion workloads, throughput can improve by about
5–10×.
+- When queries read the whole `VARIANT` value (`SELECT variant_col`), DOC mode
avoids reconstructing the document from thousands of subcolumns, delivering
orders-of-magnitude speedup.
**Getting started:**
@@ -214,7 +213,7 @@ Watch for:
- DOC mode is not the default answer for every wide-JSON workload. If hot-path
analytics dominates, sparse columns usually fit better.
- DOC mode and sparse columns are mutually exclusive. They cannot be enabled
at the same time.
-### Template D: Stable key paths and path-level governance
+### Schema Template
Choose Schema Template when a small number of paths need stable types, stable
behavior, or path-specific indexes.
@@ -245,6 +244,28 @@ Watch for:
- Do not turn the whole JSON schema into a static template. That defeats the
point of `VARIANT`.
- Schema Template should cover key paths only; the rest stays dynamic.
+## Performance
+
+The chart below compares single-path extraction time on a 10K-path wide-column
dataset (200K rows, extracting one key, 16 CPUs, median of 3 runs).
+
+
+
+| Mode | Query Time | Peak Memory |
+|---|---:|---:|
+| DOC Materialized | 76 ms | 1 MiB |
+| VARIANT Default | 76 ms | 1 MiB |
+| DOC Map (Sharded) | 148 ms | 1 MiB |
+| JSONB | 887 ms | 32 GiB |
+| DOC Map | 2,533 ms | 1 MiB |
+| MAP\<STRING,STRING\> | 2,800 ms | 1 MiB |
+| STRING (raw JSON) | 6,104 ms | 48 GiB |
+
+Key takeaways:
+
+- **Materialized subcolumns win.** Both Default and DOC Materialized deliver
~76 ms — 80× faster than raw STRING, 12× faster than JSONB.
+- **DOC Map with sharding helps.** Sharding the doc map cuts query time from
2.5 s to 148 ms for un-materialized paths.
+- **JSONB and STRING are memory-heavy.** They consume 32–48 GiB peak memory
vs. 1 MiB for VARIANT modes.
+
## Best Practices
### Import Phase
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
index bd89f3b4fdf..4c4f844629a 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
@@ -60,7 +60,7 @@
## 关键概念
-阅读下面的模板之前,先确认以下术语清晰。每个概念用 2-3 行讲清边界;实现细节请参考 [VARIANT](./VARIANT)。
+阅读下面的存储模式之前,先确认以下术语清晰。每个概念用 2-3 行讲清边界;实现细节请参考 [VARIANT](./VARIANT)。
**子列列式提取(Subcolumnization)。** 写入 `VARIANT` 列时,Doris 会自动发现 JSON
Path,并对热点路径执行子列列式提取,使其以独立子列的形式参与分析。
@@ -80,30 +80,29 @@

-**DOC mode。** 写入时延迟子列列式提取(Subcolumnization),并额外存储一份 map 格式的原始
JSON。这带来了快速导入和高效整条文档返回能力,代价是额外存储。后续 Compaction 时仍会完成 Subcolumnization。
+**DOC mode。** 写入时延迟子列列式提取(Subcolumnization),并额外存储一份 map 格式的原始 JSON(即 **doc
map**)。这带来了快速导入和高效整条文档返回能力,代价是额外存储。后续 Compaction 时仍会完成 Subcolumnization。

如上图所示,写入时 JSON 被原样保存到 Doc Store 以实现快速导入。子列在后续 Compaction 过程中提取。读取时,按路径查询(如
`SELECT v['user_id']`)从物化子列中以列式速度读取;而整条文档查询(`SELECT v`)则直接从 Doc Store
中读取,无需从大量子列重组文档。
-**Storage Format V3。** 把列元数据从 Segment Footer 中解耦出来。推荐在所有 `VARIANT` 表上使用,尤其是宽
JSON 场景,因为它消除了上千子列同时存在时的元数据瓶颈。
+DOC mode 的读取路径取决于被查询的路径是否已经物化:
-## 推荐决策路径
+
-
+- **DOC Materialized**:被查询的路径已经提取为 subcolumn(Compaction 后或
`variant_doc_materialization_min_rows` 条件满足后)。以列式速度读取,与默认 VARIANT 一样快。
+- **DOC Map**:被查询的路径尚未物化。查询回退到扫描整个 doc map 来查找值 —— 在宽 JSON 上显著变慢。
+- **DOC Map(分片)**:同样的回退路径,但通过 `variant_doc_hash_shard_count` 将 doc map
分散到多个物理列,实现并行扫描,大幅加速恢复速度。
-## 几个常见起点
+**Storage Format V3。** 把列元数据从 Segment Footer 中解耦出来。推荐在所有 `VARIANT` 表上使用,尤其是宽
JSON 场景,因为它消除了上千子列同时存在时的元数据瓶颈。
-- 事件日志或审计日志,查询主要围绕 `event_type`、`user_id`、`app_version` 这类常见路径:先用默认
`VARIANT`,只有在路径数持续增长时再考虑 Sparse。
-- 广告、遥测或用户画像载荷,属性很多,但稳定查询的只是少量热点字段:先从 Sparse 开始。
-- 模型输出、Trace 快照或文档归档,查询经常需要整条 JSON:先从 DOC mode 开始。
-- 订单、支付或设备载荷里,只有少数路径如 `order_id`、`status`、`device_id` 必须保持稳定类型并建立索引:只给这些路径加
Schema Template。
+## 推荐决策路径
-对大多数 workload 来说,默认配置已经是合适的起点。只有在访问模式比较特殊时,才需要按场景调优。典型例子包括 AI
训练特征载荷、车联网遥测、用户标签系统这类需要支撑大规模子列列式提取(Subcolumnization)和大量路径级索引的场景。
+
-## 推荐起步模板
+## 存储模式
-先用下表选一个起点,再看对应模板。
+先用下表选一个起点,再看对应章节。
| | 典型场景 | 推荐模式 | 关键配置 |
|---|---|---|---|
@@ -112,7 +111,7 @@
| **C** | 模型输出/Trace/归档(写入优先或整条返回) | DOC mode + V3 |
`variant_enable_doc_mode`、`variant_doc_materialization_min_rows` |
| **D** | 订单/支付/设备(关键路径需稳定类型) | Schema Template + A 或 B | 只定义关键路径 |
-### 模板 A:默认半结构化分析
+### 默认模式
这是大多数新 `VARIANT` 场景最稳妥的起点。
@@ -139,7 +138,7 @@ PROPERTIES (
- 不要在没有证据的情况下,一开始就把 `variant_max_subcolumns_count` 调得很大。
- 如果 JSON 并不宽,开启 Sparse 或 DOC mode 只会增加复杂度而没有收益。
-### 模板 B:宽 JSON + 热点路径分析
+### Sparse 模式
当 JSON 很宽,但查询依然集中在少量热点路径上时,优先 Sparse。
@@ -170,9 +169,9 @@ PROPERTIES (
注意:
- 如果瓶颈还是热点路径分析,就不要先跳到 DOC mode。
-- `variant_max_subcolumns_count` 默认就是 `2048`,已经覆盖大多数 workload
的自动子列列式提取需求。不要把它设得过大,导致事实上全部路径都被列化;如果场景确实需要更大规模的子列列式提取,优先参考 [模板 C(DOC
mode)](#doc-mode-template)。
+- `variant_max_subcolumns_count` 默认就是 `2048`,已经覆盖大多数 workload
的自动子列列式提取需求。不要把它设得过大,导致事实上全部路径都被列化;如果场景确实需要更大规模的子列列式提取,优先参考 [DOC
模式](#doc-mode-template)。
-### 模板 C:导入优先或整条文档返回 {#doc-mode-template}
+### DOC 模式 {#doc-mode-template}
当整条 JSON 返回能力或写入效率比路径分析更重要时,优先 DOC mode。
@@ -180,10 +179,10 @@ PROPERTIES (
**DOC mode 更适合下面几类诉求:**
-- 更稳定地支撑超大规模子列列式提取(Subcolumnization),例如万列级别。若参与子列列式提取(Subcolumnization)的路径接近
10000,对硬件要求会明显提高,因此应优先评估 DOC mode。
+- 当子列列式提取规模接近万列级别时,硬件要求会明显提高。DOC mode 在这个规模下更稳定。
- 相比默认的即时 Subcolumnization,compaction 内存可下降约 2/3。
-- 在稀疏宽列导入场景下,导入性能可提升约 5~10 倍。
-- 如果列非常宽,且查询经常直接读取整个 `VARIANT` 值,DOC mode 相比从大量子列重组文档,`SELECT variant_col`
的效率可获得数量级提升。
+- 在稀疏宽列导入场景下,导入性能可提升约 5~10 倍。
+- 当查询直接读取整个 `VARIANT` 值(`SELECT variant_col`)时,DOC mode 无需从数千子列重组文档,效率可获得数量级提升。
**开始使用:**
@@ -214,7 +213,7 @@ PROPERTIES (
- DOC mode 不是所有宽 JSON 场景的默认答案。若核心诉求是热点路径分析,通常还是 Sparse 更合适。
- DOC mode 和 Sparse 互斥,不能同时开启。
-### 模板 D:关键路径稳定治理
+### Schema Template 模式
当只有少量路径需要稳定类型、稳定行为或路径级索引时,优先 Schema Template。
@@ -245,6 +244,28 @@ PROPERTIES (
- 不要试图把整个 JSON 都静态模板化,这会削弱 `VARIANT` 的意义。
- Schema Template 只用于关键路径,其余保持动态。
+## 性能
+
+下图对比了 10K 路径宽列数据集上的单路径提取耗时(200K 行,提取 key5000,16 CPU,3 次取中位数)。
+
+
+
+| 模式 | 查询耗时 | 峰值内存 |
+|---|---:|---:|
+| DOC Materialized | 76 ms | 1 MiB |
+| VARIANT 默认 | 76 ms | 1 MiB |
+| DOC Map(分片) | 148 ms | 1 MiB |
+| JSONB | 887 ms | 32 GiB |
+| DOC Map | 2,533 ms | 1 MiB |
+| MAP\<STRING,STRING\> | 2,800 ms | 1 MiB |
+| STRING (原始 JSON) | 6,104 ms | 48 GiB |
+
+要点:
+
+- **物化子列最快。** 默认模式和 DOC Materialized 均约 76 ms —— 是原始 STRING 的 80 倍、JSONB 的 12
倍。
+- **DOC Map 分片有效。** 分片后 doc map 查询从 2.5 s 降至 148 ms。
+- **JSONB 和 STRING 内存开销大。** 峰值内存 32–48 GiB,而 VARIANT 各模式仅 1 MiB。
+
## 最佳实践
### 导入阶段
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
index 04d0b75e6a4..75ec43472b9 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
@@ -43,7 +43,7 @@
## 关键概念
-阅读下面的模板之前,先确认以下术语清晰。每个概念用 2-3 行讲清边界;实现细节请参考 [VARIANT](./VARIANT)。
+阅读下面的存储模式之前,先确认以下术语清晰。每个概念用 2-3 行讲清边界;实现细节请参考 [VARIANT](./VARIANT)。
**子列列式提取(Subcolumnization)。** 写入 `VARIANT` 列时,Doris 会自动发现 JSON
Path,并对热点路径执行子列列式提取,使其以独立子列的形式参与分析。
@@ -67,9 +67,9 @@
对大多数 workload 来说,默认配置已经是合适的起点。只有在访问模式比较特殊时,才需要按场景调优。典型例子包括 AI
训练特征载荷、车联网遥测、用户标签系统这类在 Doris 3.1
及以上版本中需要支撑大规模子列列式提取(Subcolumnization)和大量路径级索引的场景。
-## 推荐起步模板
+## 存储模式
-先用下表选一个起点,再看对应模板。
+先用下表选一个起点,再看对应章节。
| | 典型场景 | 推荐模式 | 关键配置 |
|---|---|---|---|
@@ -77,7 +77,7 @@
| **B** | 广告/遥测/用户画像(宽、热点少) | Sparse(3.1+) | `variant_max_subcolumns_count` |
| **C** | 订单/支付/设备(关键路径需稳定类型) | Schema Template(3.1+) + A 或 B | 只定义关键路径 |
-### 模板 A:默认半结构化分析
+### 默认模式
这是大多数新 `VARIANT` 场景最稳妥的起点。
@@ -103,7 +103,7 @@ PROPERTIES (
- 不要在没有证据的情况下,一开始就把 `variant_max_subcolumns_count` 调得很大。
- 如果 JSON 并不宽,开启 Sparse 只会增加复杂度而没有收益。
-### 模板 B:宽 JSON + 热点路径分析
+### Sparse 模式
> 此模板需要 Doris 3.1.0 及以上版本。
@@ -137,7 +137,7 @@ PROPERTIES (
- 如果瓶颈还是热点路径分析,Sparse 是 3.x 中的正确方向。
- 不要把 `variant_max_subcolumns_count` 设得过大,导致事实上全部路径都被列化。这会增加元数据和 Compaction 开销。
-### 模板 C:关键路径稳定类型与治理
+### Schema Template 模式
> 此模板需要 Doris 3.1.0 及以上版本。
@@ -169,6 +169,24 @@ PROPERTIES (
- 不要试图把整个 JSON schema 都固化成静态模板,这会抵消 `VARIANT` 的价值。
- Schema Template 只用于关键路径,其余保持动态。
+## 性能
+
+下图对比了 10K 路径宽列数据集上的单路径提取耗时(200K 行,提取 key5000,16 CPU,3 次取中位数)。
+
+
+
+| 模式 | 查询耗时 | 峰值内存 |
+|---|---:|---:|
+| VARIANT 默认 | 76 ms | 1 MiB |
+| JSONB | 887 ms | 32 GiB |
+| MAP\<STRING,STRING\> | 2,800 ms | 1 MiB |
+| STRING(原始 JSON) | 6,104 ms | 48 GiB |
+
+要点:
+
+- **VARIANT 默认最快。** 76 ms —— 是 JSONB 的 12 倍、原始 STRING 的 80 倍。
+- **JSONB 和 STRING 内存开销大。** 峰值内存 32–48 GiB,而 VARIANT 仅 1 MiB。
+
## 最佳实践
### 导入阶段
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
index bd89f3b4fdf..4c4f844629a 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
@@ -60,7 +60,7 @@
## 关键概念
-阅读下面的模板之前,先确认以下术语清晰。每个概念用 2-3 行讲清边界;实现细节请参考 [VARIANT](./VARIANT)。
+阅读下面的存储模式之前,先确认以下术语清晰。每个概念用 2-3 行讲清边界;实现细节请参考 [VARIANT](./VARIANT)。
**子列列式提取(Subcolumnization)。** 写入 `VARIANT` 列时,Doris 会自动发现 JSON
Path,并对热点路径执行子列列式提取,使其以独立子列的形式参与分析。
@@ -80,30 +80,29 @@

-**DOC mode。** 写入时延迟子列列式提取(Subcolumnization),并额外存储一份 map 格式的原始
JSON。这带来了快速导入和高效整条文档返回能力,代价是额外存储。后续 Compaction 时仍会完成 Subcolumnization。
+**DOC mode。** 写入时延迟子列列式提取(Subcolumnization),并额外存储一份 map 格式的原始 JSON(即 **doc
map**)。这带来了快速导入和高效整条文档返回能力,代价是额外存储。后续 Compaction 时仍会完成 Subcolumnization。

如上图所示,写入时 JSON 被原样保存到 Doc Store 以实现快速导入。子列在后续 Compaction 过程中提取。读取时,按路径查询(如
`SELECT v['user_id']`)从物化子列中以列式速度读取;而整条文档查询(`SELECT v`)则直接从 Doc Store
中读取,无需从大量子列重组文档。
-**Storage Format V3。** 把列元数据从 Segment Footer 中解耦出来。推荐在所有 `VARIANT` 表上使用,尤其是宽
JSON 场景,因为它消除了上千子列同时存在时的元数据瓶颈。
+DOC mode 的读取路径取决于被查询的路径是否已经物化:
-## 推荐决策路径
+
-
+- **DOC Materialized**:被查询的路径已经提取为 subcolumn(Compaction 后或
`variant_doc_materialization_min_rows` 条件满足后)。以列式速度读取,与默认 VARIANT 一样快。
+- **DOC Map**:被查询的路径尚未物化。查询回退到扫描整个 doc map 来查找值 —— 在宽 JSON 上显著变慢。
+- **DOC Map(分片)**:同样的回退路径,但通过 `variant_doc_hash_shard_count` 将 doc map
分散到多个物理列,实现并行扫描,大幅加速恢复速度。
-## 几个常见起点
+**Storage Format V3。** 把列元数据从 Segment Footer 中解耦出来。推荐在所有 `VARIANT` 表上使用,尤其是宽
JSON 场景,因为它消除了上千子列同时存在时的元数据瓶颈。
-- 事件日志或审计日志,查询主要围绕 `event_type`、`user_id`、`app_version` 这类常见路径:先用默认
`VARIANT`,只有在路径数持续增长时再考虑 Sparse。
-- 广告、遥测或用户画像载荷,属性很多,但稳定查询的只是少量热点字段:先从 Sparse 开始。
-- 模型输出、Trace 快照或文档归档,查询经常需要整条 JSON:先从 DOC mode 开始。
-- 订单、支付或设备载荷里,只有少数路径如 `order_id`、`status`、`device_id` 必须保持稳定类型并建立索引:只给这些路径加
Schema Template。
+## 推荐决策路径
-对大多数 workload 来说,默认配置已经是合适的起点。只有在访问模式比较特殊时,才需要按场景调优。典型例子包括 AI
训练特征载荷、车联网遥测、用户标签系统这类需要支撑大规模子列列式提取(Subcolumnization)和大量路径级索引的场景。
+
-## 推荐起步模板
+## 存储模式
-先用下表选一个起点,再看对应模板。
+先用下表选一个起点,再看对应章节。
| | 典型场景 | 推荐模式 | 关键配置 |
|---|---|---|---|
@@ -112,7 +111,7 @@
| **C** | 模型输出/Trace/归档(写入优先或整条返回) | DOC mode + V3 |
`variant_enable_doc_mode`、`variant_doc_materialization_min_rows` |
| **D** | 订单/支付/设备(关键路径需稳定类型) | Schema Template + A 或 B | 只定义关键路径 |
-### 模板 A:默认半结构化分析
+### 默认模式
这是大多数新 `VARIANT` 场景最稳妥的起点。
@@ -139,7 +138,7 @@ PROPERTIES (
- 不要在没有证据的情况下,一开始就把 `variant_max_subcolumns_count` 调得很大。
- 如果 JSON 并不宽,开启 Sparse 或 DOC mode 只会增加复杂度而没有收益。
-### 模板 B:宽 JSON + 热点路径分析
+### Sparse 模式
当 JSON 很宽,但查询依然集中在少量热点路径上时,优先 Sparse。
@@ -170,9 +169,9 @@ PROPERTIES (
注意:
- 如果瓶颈还是热点路径分析,就不要先跳到 DOC mode。
-- `variant_max_subcolumns_count` 默认就是 `2048`,已经覆盖大多数 workload
的自动子列列式提取需求。不要把它设得过大,导致事实上全部路径都被列化;如果场景确实需要更大规模的子列列式提取,优先参考 [模板 C(DOC
mode)](#doc-mode-template)。
+- `variant_max_subcolumns_count` 默认就是 `2048`,已经覆盖大多数 workload
的自动子列列式提取需求。不要把它设得过大,导致事实上全部路径都被列化;如果场景确实需要更大规模的子列列式提取,优先参考 [DOC
模式](#doc-mode-template)。
-### 模板 C:导入优先或整条文档返回 {#doc-mode-template}
+### DOC 模式 {#doc-mode-template}
当整条 JSON 返回能力或写入效率比路径分析更重要时,优先 DOC mode。
@@ -180,10 +179,10 @@ PROPERTIES (
**DOC mode 更适合下面几类诉求:**
-- 更稳定地支撑超大规模子列列式提取(Subcolumnization),例如万列级别。若参与子列列式提取(Subcolumnization)的路径接近
10000,对硬件要求会明显提高,因此应优先评估 DOC mode。
+- 当子列列式提取规模接近万列级别时,硬件要求会明显提高。DOC mode 在这个规模下更稳定。
- 相比默认的即时 Subcolumnization,compaction 内存可下降约 2/3。
-- 在稀疏宽列导入场景下,导入性能可提升约 5~10 倍。
-- 如果列非常宽,且查询经常直接读取整个 `VARIANT` 值,DOC mode 相比从大量子列重组文档,`SELECT variant_col`
的效率可获得数量级提升。
+- 在稀疏宽列导入场景下,导入性能可提升约 5~10 倍。
+- 当查询直接读取整个 `VARIANT` 值(`SELECT variant_col`)时,DOC mode 无需从数千子列重组文档,效率可获得数量级提升。
**开始使用:**
@@ -214,7 +213,7 @@ PROPERTIES (
- DOC mode 不是所有宽 JSON 场景的默认答案。若核心诉求是热点路径分析,通常还是 Sparse 更合适。
- DOC mode 和 Sparse 互斥,不能同时开启。
-### 模板 D:关键路径稳定治理
+### Schema Template 模式
当只有少量路径需要稳定类型、稳定行为或路径级索引时,优先 Schema Template。
@@ -245,6 +244,28 @@ PROPERTIES (
- 不要试图把整个 JSON 都静态模板化,这会削弱 `VARIANT` 的意义。
- Schema Template 只用于关键路径,其余保持动态。
+## 性能
+
+下图对比了 10K 路径宽列数据集上的单路径提取耗时(200K 行,提取 key5000,16 CPU,3 次取中位数)。
+
+
+
+| 模式 | 查询耗时 | 峰值内存 |
+|---|---:|---:|
+| DOC Materialized | 76 ms | 1 MiB |
+| VARIANT 默认 | 76 ms | 1 MiB |
+| DOC Map(分片) | 148 ms | 1 MiB |
+| JSONB | 887 ms | 32 GiB |
+| DOC Map | 2,533 ms | 1 MiB |
+| MAP\<STRING,STRING\> | 2,800 ms | 1 MiB |
+| STRING (原始 JSON) | 6,104 ms | 48 GiB |
+
+要点:
+
+- **物化子列最快。** 默认模式和 DOC Materialized 均约 76 ms —— 是原始 STRING 的 80 倍、JSONB 的 12
倍。
+- **DOC Map 分片有效。** 分片后 doc map 查询从 2.5 s 降至 148 ms。
+- **JSONB 和 STRING 内存开销大。** 峰值内存 32–48 GiB,而 VARIANT 各模式仅 1 MiB。
+
## 最佳实践
### 导入阶段
diff --git a/static/images/variant/variant-bench-query-time-3x.svg
b/static/images/variant/variant-bench-query-time-3x.svg
new file mode 100644
index 00000000000..c7d71115f6f
--- /dev/null
+++ b/static/images/variant/variant-bench-query-time-3x.svg
@@ -0,0 +1,53 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="880" height="290" viewBox="0 0
880 290">
+ <style>
+ text { font-family: Inter, 'Segoe UI', system-ui, -apple-system,
sans-serif; }
+ .title { font-size: 17px; font-weight: 700; fill: #1e293b; }
+ .subtitle { font-size: 11px; fill: #94a3b8; }
+ .label { font-size: 13px; font-weight: 600; fill: #334155; }
+ .value { font-size: 13px; font-weight: 700; }
+ .grid-label { font-size: 10px; fill: #94a3b8; }
+ .footer { font-size: 10px; fill: #94a3b8; font-style: italic; }
+ .grid-line { stroke: #e2e8f0; stroke-width: 1; stroke-dasharray: 4,3; }
+ .axis-line { stroke: #cbd5e1; stroke-width: 1.5; }
+ </style>
+
+ <rect width="880" height="290" fill="#ffffff"/>
+
+ <text x="440" y="28" text-anchor="middle" class="title">Wide-Column
Single-Path Extraction: Query Time</text>
+ <text x="440" y="46" text-anchor="middle" class="subtitle">200K rows × 10K
paths · extracting key5000 · 16 CPUs · median of 3 runs</text>
+
+ <line x1="200" y1="62" x2="200" y2="258" class="grid-line"/>
+ <text x="200" y="58" text-anchor="middle" class="grid-label">0</text>
+ <line x1="298" y1="62" x2="298" y2="258" class="grid-line"/>
+ <text x="298" y="58" text-anchor="middle" class="grid-label">1 s</text>
+ <line x1="397" y1="62" x2="397" y2="258" class="grid-line"/>
+ <text x="397" y="58" text-anchor="middle" class="grid-label">2 s</text>
+ <line x1="495" y1="62" x2="495" y2="258" class="grid-line"/>
+ <text x="495" y="58" text-anchor="middle" class="grid-label">3 s</text>
+ <line x1="594" y1="62" x2="594" y2="258" class="grid-line"/>
+ <text x="594" y="58" text-anchor="middle" class="grid-label">4 s</text>
+ <line x1="692" y1="62" x2="692" y2="258" class="grid-line"/>
+ <text x="692" y="58" text-anchor="middle" class="grid-label">5 s</text>
+ <line x1="790" y1="62" x2="790" y2="258" class="grid-line"/>
+ <text x="790" y="58" text-anchor="middle" class="grid-label">6 s</text>
+
+ <line x1="200" y1="258" x2="810" y2="258" class="axis-line"/>
+
+ <text x="195" y="93" text-anchor="end" class="label">VARIANT Default</text>
+ <rect x="200" y="76" width="8" height="36" rx="5" fill="#10B981"/>
+ <text x="214" y="99" class="value" fill="#10B981">76 ms</text>
+
+ <text x="195" y="142" text-anchor="end" class="label">JSONB</text>
+ <rect x="200" y="125" width="87" height="36" rx="5" fill="#EAB308"/>
+ <text x="293" y="148" class="value" fill="#a16207">887 ms</text>
+
+ <text x="195" y="191" text-anchor="end"
class="label">MAP<STRING,STRING></text>
+ <rect x="200" y="174" width="275" height="36" rx="5" fill="#EF4444"/>
+ <text x="481" y="197" class="value" fill="#dc2626">2,800 ms</text>
+
+ <text x="195" y="240" text-anchor="end" class="label">STRING (raw
JSON)</text>
+ <rect x="200" y="223" width="601" height="36" rx="5" fill="#B91C1C"/>
+ <text x="807" y="246" text-anchor="start" class="value" fill="#991b1b">6,104
ms</text>
+
+ <text x="440" y="283" text-anchor="middle" class="footer">Doris VARIANT
wide-column benchmark · 2026-03</text>
+</svg>
diff --git a/static/images/variant/variant-bench-query-time.svg
b/static/images/variant/variant-bench-query-time.svg
new file mode 100644
index 00000000000..9f800f2599e
--- /dev/null
+++ b/static/images/variant/variant-bench-query-time.svg
@@ -0,0 +1,65 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="880" height="435" viewBox="0 0
880 435">
+ <style>
+ text { font-family: Inter, 'Segoe UI', system-ui, -apple-system,
sans-serif; }
+ .title { font-size: 17px; font-weight: 700; fill: #1e293b; }
+ .subtitle { font-size: 11px; fill: #94a3b8; }
+ .label { font-size: 13px; font-weight: 600; fill: #334155; }
+ .value { font-size: 13px; font-weight: 700; }
+ .grid-label { font-size: 10px; fill: #94a3b8; }
+ .footer { font-size: 10px; fill: #94a3b8; font-style: italic; }
+ .grid-line { stroke: #e2e8f0; stroke-width: 1; stroke-dasharray: 4,3; }
+ .axis-line { stroke: #cbd5e1; stroke-width: 1.5; }
+ </style>
+
+ <rect width="880" height="435" fill="#ffffff"/>
+
+ <text x="440" y="28" text-anchor="middle" class="title">Wide-Column
Single-Path Extraction: Query Time</text>
+ <text x="440" y="46" text-anchor="middle" class="subtitle">200K rows × 10K
paths · extracting key5000 · 16 CPUs · median of 3 runs</text>
+
+ <line x1="200" y1="62" x2="200" y2="402" class="grid-line"/>
+ <text x="200" y="58" text-anchor="middle" class="grid-label">0</text>
+ <line x1="298" y1="62" x2="298" y2="402" class="grid-line"/>
+ <text x="298" y="58" text-anchor="middle" class="grid-label">1 s</text>
+ <line x1="397" y1="62" x2="397" y2="402" class="grid-line"/>
+ <text x="397" y="58" text-anchor="middle" class="grid-label">2 s</text>
+ <line x1="495" y1="62" x2="495" y2="402" class="grid-line"/>
+ <text x="495" y="58" text-anchor="middle" class="grid-label">3 s</text>
+ <line x1="594" y1="62" x2="594" y2="402" class="grid-line"/>
+ <text x="594" y="58" text-anchor="middle" class="grid-label">4 s</text>
+ <line x1="692" y1="62" x2="692" y2="402" class="grid-line"/>
+ <text x="692" y="58" text-anchor="middle" class="grid-label">5 s</text>
+ <line x1="790" y1="62" x2="790" y2="402" class="grid-line"/>
+ <text x="790" y="58" text-anchor="middle" class="grid-label">6 s</text>
+
+ <line x1="200" y1="402" x2="810" y2="402" class="axis-line"/>
+
+ <text x="195" y="93" text-anchor="end" class="label">DOC Materialized</text>
+ <rect x="200" y="76" width="8" height="36" rx="5" fill="#059669"/>
+ <text x="214" y="99" class="value" fill="#059669">76 ms</text>
+
+ <text x="195" y="142" text-anchor="end" class="label">VARIANT Default</text>
+ <rect x="200" y="125" width="8" height="36" rx="5" fill="#10B981"/>
+ <text x="214" y="148" class="value" fill="#10B981">76 ms</text>
+
+ <text x="195" y="191" text-anchor="end" class="label">DOC Map
(Sharded)</text>
+ <rect x="200" y="174" width="15" height="36" rx="5" fill="#3B82F6"/>
+ <text x="221" y="197" class="value" fill="#2563eb">148 ms</text>
+
+ <text x="195" y="240" text-anchor="end" class="label">JSONB</text>
+ <rect x="200" y="223" width="87" height="36" rx="5" fill="#EAB308"/>
+ <text x="293" y="246" class="value" fill="#a16207">887 ms</text>
+
+ <text x="195" y="289" text-anchor="end" class="label">DOC Map</text>
+ <rect x="200" y="272" width="249" height="36" rx="5" fill="#F97316"/>
+ <text x="455" y="295" class="value" fill="#c2410c">2,533 ms</text>
+
+ <text x="195" y="338" text-anchor="end"
class="label">MAP<STRING,STRING></text>
+ <rect x="200" y="321" width="275" height="36" rx="5" fill="#EF4444"/>
+ <text x="481" y="344" class="value" fill="#dc2626">2,800 ms</text>
+
+ <text x="195" y="387" text-anchor="end" class="label">STRING (raw
JSON)</text>
+ <rect x="200" y="370" width="601" height="36" rx="5" fill="#B91C1C"/>
+ <text x="807" y="393" text-anchor="start" class="value" fill="#991b1b">6,104
ms</text>
+
+ <text x="440" y="428" text-anchor="middle" class="footer">Doris VARIANT
wide-column benchmark · 2026-03</text>
+</svg>
diff --git a/static/images/variant/variant-doc-mode-readpaths.png
b/static/images/variant/variant-doc-mode-readpaths.png
new file mode 100644
index 00000000000..7644a3bdddf
Binary files /dev/null and
b/static/images/variant/variant-doc-mode-readpaths.png differ
diff --git
a/versioned_docs/version-3.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
b/versioned_docs/version-3.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
index c8e1167fd4d..1563fc80178 100644
---
a/versioned_docs/version-3.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
+++
b/versioned_docs/version-3.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
@@ -43,7 +43,7 @@ Prefer static columns when these conditions dominate:
## Key Concepts
-Before reading the templates below, make sure these terms are clear. Each is
explained in 2-3 lines; for implementation details, see [VARIANT](./VARIANT).
+Before reading the storage modes below, make sure these terms are clear. Each
is explained in 2-3 lines; for implementation details, see [VARIANT](./VARIANT).
**Subcolumnization.** When data is written into a `VARIANT` column, Doris
automatically discovers JSON paths and extracts hot paths as independent
columnar subcolumns for efficient analytics.
@@ -67,9 +67,9 @@ For wide JSON where most queries return the whole document,
Doris 3.x `VARIANT`
For most workloads, the default configuration is already the right starting
point. Tune only when the access pattern is clearly unusual. Typical examples
include AI training feature payloads, connected-vehicle telemetry, and user-tag
systems that, in Doris 3.1 and later, need unusually large-scale
Subcolumnization together with many path-level indexes.
-## Recommended templates
+## Storage Modes
-Use the table below to pick a starting point, then read the matching template.
+Use the table below to pick a starting point, then read the matching section.
| | Typical scenario | Recommended mode | Key configuration |
|---|---|---|---|
@@ -77,7 +77,7 @@ Use the table below to pick a starting point, then read the
matching template.
| **B** | Advertising / telemetry / user profiles (wide, hot paths few) |
Sparse (3.1+) | `variant_max_subcolumns_count` |
| **C** | Orders / payments / devices (key paths need stable types) | Schema
Template (3.1+) + A or B | Define only key paths |
-### Template A: Default semi-structured analytics
+### Default Mode
This is the safest starting point for most new `VARIANT` workloads.
@@ -103,7 +103,7 @@ Watch for:
- Do not raise `variant_max_subcolumns_count` early unless path growth is
already causing pressure.
- If the JSON is not wide, enabling sparse columns adds complexity without
benefit.
-### Template B: Wide JSON with hot-path analytics
+### Sparse Mode
> This template requires Doris 3.1.0 or later.
@@ -137,7 +137,7 @@ Watch for:
- If hot-path analytics is the bottleneck, sparse columns are the right
direction in 3.x.
- Do not set `variant_max_subcolumns_count` so large that effectively all
paths go through Subcolumnization. That defeats the purpose and increases
metadata and compaction cost.
-### Template C: Stable key paths and path-level governance
+### Schema Template
> This template requires Doris 3.1.0 or later.
@@ -169,6 +169,24 @@ Watch for:
- Do not turn the whole JSON schema into a static template. That defeats the
point of `VARIANT`.
- Schema Template should cover key paths only; the rest stays dynamic.
+## Performance
+
+The chart below compares single-path extraction time on a 10K-path wide-column
dataset (200K rows, extracting one key, 16 CPUs, median of 3 runs).
+
+
+
+| Mode | Query Time | Peak Memory |
+|---|---:|---:|
+| VARIANT Default | 76 ms | 1 MiB |
+| JSONB | 887 ms | 32 GiB |
+| MAP\<STRING,STRING\> | 2,800 ms | 1 MiB |
+| STRING (raw JSON) | 6,104 ms | 48 GiB |
+
+Key takeaways:
+
+- **VARIANT Default is fastest.** 76 ms — 12× faster than JSONB, 80× faster
than raw STRING.
+- **JSONB and STRING are memory-heavy.** They consume 32–48 GiB peak memory
vs. 1 MiB for VARIANT.
+
## Best Practices
### Import Phase
diff --git
a/versioned_docs/version-4.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
b/versioned_docs/version-4.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
index b342002a199..3dcf21b6eaa 100644
---
a/versioned_docs/version-4.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
+++
b/versioned_docs/version-4.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md
@@ -60,7 +60,7 @@ You have a wide-JSON problem when path count keeps growing
and starts to create
## Key Concepts
-Before reading the templates below, make sure these terms are clear. Each is
explained in 2-3 lines; for implementation details, see [VARIANT](./VARIANT).
+Before reading the storage modes below, make sure these terms are clear. Each
is explained in 2-3 lines; for implementation details, see [VARIANT](./VARIANT).
**Subcolumnization.** When data is written into a `VARIANT` column, Doris
automatically discovers JSON paths and extracts hot paths as independent
columnar subcolumns for efficient analytics.
@@ -80,30 +80,29 @@ As shown above, hot paths (such as `user_id`, `page`) stay
as independent column

-**DOC mode.** Delays Subcolumnization at write time and additionally stores
the original JSON as a map-format stored field. This gives fast ingest and
efficient whole-document return at the cost of extra storage. Subcolumnization
still happens later during compaction.
+**DOC mode.** Delays Subcolumnization at write time and additionally stores
the original JSON as a map-format stored field (the **doc map**). This gives
fast ingest and efficient whole-document return at the cost of extra storage.
Subcolumnization still happens later during compaction.

As illustrated above, during write the JSON is preserved as-is into a Doc
Store for fast ingest. Subcolumns are extracted later during compaction. At
read time, path-based queries (e.g. `SELECT v['user_id']`) read from
materialized subcolumns at full columnar speed, while whole-document queries
(`SELECT v`) read directly from the Doc Store without reconstructing from
subcolumns.
-**Storage Format V3.** Decouples column metadata from the segment footer.
Recommended for any `VARIANT` table, especially wide JSON, because it
eliminates the metadata bottleneck when thousands of subcolumns exist.
+DOC mode has three distinct read paths depending on whether the queried path
has been materialized:
-## Recommended Decision Path
+
-
+- **DOC Materialized**: The queried path has already been extracted into a
subcolumn (after compaction or when `variant_doc_materialization_min_rows` is
met). Reads at full columnar speed, same as default VARIANT.
+- **DOC Map**: The queried path has not been materialized yet. The query falls
back to scanning the entire doc map to find the value — significantly slower on
wide JSON.
+- **DOC Map (Sharded)**: Same fallback, but with
`variant_doc_hash_shard_count` the doc map is distributed across multiple
physical columns, enabling parallel scan and much faster recovery.
-## Typical starting points
+**Storage Format V3.** Decouples column metadata from the segment footer.
Recommended for any `VARIANT` table, especially wide JSON, because it
eliminates the metadata bottleneck when thousands of subcolumns exist.
-- Event or audit logs where queries mostly filter on `event_type`, `user_id`,
or `app_version`: start with default `VARIANT`, then move to sparse columns
only if path count keeps growing.
-- Advertising, telemetry, or profile payloads with thousands of optional
attributes but analytics on a small hot set: start with sparse columns.
-- Model responses, trace snapshots, or archived JSON documents that are often
returned as complete payloads: start with DOC mode.
-- Order, payment, or device payloads where a few paths such as `order_id`,
`status`, or `device_id` must stay typed and indexed: add Schema Template only
for those paths.
+## Recommended Decision Path
-For most workloads, the default configuration is already the right starting
point. Tune only when the access pattern is clearly unusual. Typical examples
include AI training feature payloads, connected-vehicle telemetry, and user-tag
systems that need unusually large-scale Subcolumnization together with many
path-level indexes.
+
-## Recommended templates
+## Storage Modes
-Use the table below to pick a starting point, then read the matching template.
+Use the table below to pick a starting point, then read the matching section.
| | Typical scenario | Recommended mode | Key configuration |
|---|---|---|---|
@@ -112,7 +111,7 @@ Use the table below to pick a starting point, then read the
matching template.
| **C** | Model output / trace / archives (ingest-first or whole-doc return) |
DOC mode + V3 | `variant_enable_doc_mode`,
`variant_doc_materialization_min_rows` |
| **D** | Orders / payments / devices (key paths need stable types) | Schema
Template + A or B | Define only key paths |
-### Template A: Default semi-structured analytics
+### Default Mode
This is the safest starting point for most new `VARIANT` workloads.
@@ -139,7 +138,7 @@ Watch for:
- Do not raise `variant_max_subcolumns_count` early unless path growth is
already causing pressure.
- If the JSON is not wide, enabling sparse columns or DOC mode adds complexity
without benefit.
-### Template B: Wide JSON with hot-path analytics
+### Sparse Mode
Choose sparse columns when the payload is wide, but most queries still focus
on a small set of hot paths.
@@ -170,9 +169,9 @@ Use it when the total key count is very large, but the
primary workload is still
Watch for:
- If hot-path analytics is the bottleneck, do not jump to DOC mode first.
-- `variant_max_subcolumns_count` defaults to `2048`, which is already the
right starting point for automatic Subcolumnization in most workloads. Do not
set it so large that effectively all paths go through Subcolumnization. If the
workload truly needs very large extracted-subcolumn scale, prefer [Template C
(DOC mode)](#doc-mode-template).
+- `variant_max_subcolumns_count` defaults to `2048`, which is already the
right starting point for automatic Subcolumnization in most workloads. Do not
set it so large that effectively all paths go through Subcolumnization. If the
workload truly needs very large extracted-subcolumn scale, prefer [DOC
Mode](#doc-mode-template).
-### Template C: Ingest-first or whole-document return {#doc-mode-template}
+### DOC Mode {#doc-mode-template}
Choose DOC mode when returning the whole JSON document or minimizing ingest
overhead matters more than optimizing path-based analytics.
@@ -180,10 +179,10 @@ Typical example: model responses, trace snapshots, or
archived JSON documents th
DOC mode helps when:
-- DOC mode is the more stable choice when Subcolumnization scale becomes
extremely large, for example around ten-thousand extracted columns. When paths
participating in Subcolumnization approach 10,000, hardware requirements rise
quickly, so DOC mode should be evaluated first.
-- Compared with default eager Subcolumnization, compaction memory can drop by
about two-thirds.
-- In sparse wide-column ingestion workloads, throughput can improve by about
5–10x.
-- If the column is very wide and queries often read the whole `VARIANT` value,
DOC mode can improve `SELECT variant_col` performance by orders of magnitude
compared with reconstructing the document from many subcolumns.
+- When Subcolumnization scale becomes extremely large (approaching 10,000
paths), hardware requirements rise quickly. DOC mode is the more stable choice
at this scale.
+- Compaction memory can drop by about two-thirds compared with default eager
Subcolumnization.
+- In sparse wide-column ingestion workloads, throughput can improve by about
5–10×.
+- When queries read the whole `VARIANT` value (`SELECT variant_col`), DOC mode
avoids reconstructing the document from thousands of subcolumns, delivering
orders-of-magnitude speedup.
**Getting started:**
@@ -214,7 +213,7 @@ Watch for:
- DOC mode is not the default answer for every wide-JSON workload. If hot-path
analytics dominates, sparse columns usually fit better.
- DOC mode and sparse columns are mutually exclusive. They cannot be enabled
at the same time.
-### Template D: Stable key paths and path-level governance
+### Schema Template
Choose Schema Template when a small number of paths need stable types, stable
behavior, or path-specific indexes.
@@ -245,6 +244,28 @@ Watch for:
- Do not turn the whole JSON schema into a static template. That defeats the
point of `VARIANT`.
- Schema Template should cover key paths only; the rest stays dynamic.
+## Performance
+
+The chart below compares single-path extraction time on a 10K-path wide-column
dataset (200K rows, extracting one key, 16 CPUs, median of 3 runs).
+
+
+
+| Mode | Query Time | Peak Memory |
+|---|---:|---:|
+| DOC Materialized | 76 ms | 1 MiB |
+| VARIANT Default | 76 ms | 1 MiB |
+| DOC Map (Sharded) | 148 ms | 1 MiB |
+| JSONB | 887 ms | 32 GiB |
+| DOC Map | 2,533 ms | 1 MiB |
+| MAP\<STRING,STRING\> | 2,800 ms | 1 MiB |
+| STRING (raw JSON) | 6,104 ms | 48 GiB |
+
+Key takeaways:
+
+- **Materialized subcolumns win.** Both Default and DOC Materialized deliver
~76 ms — 80× faster than raw STRING, 12× faster than JSONB.
+- **DOC Map with sharding helps.** Sharding the doc map cuts query time from
2.5 s to 148 ms for un-materialized paths.
+- **JSONB and STRING are memory-heavy.** They consume 32–48 GiB peak memory
vs. 1 MiB for VARIANT modes.
+
## Best Practices
### Import Phase
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]