(doris-website) branch master updated: optimization after performance (#3177)

yiguolei Wed, 10 Dec 2025 20:21:10 -0800

This is an automated email from the ASF dual-hosted git repository.

yiguolei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 168e597670d optimization after performance (#3177)
168e597670d is described below

commit 168e597670d8998bf1a81371c1af717d24a28096
Author: zhiqiang <[email protected]>
AuthorDate: Thu Dec 11 12:20:57 2025 +0800

    optimization after performance (#3177)
    
    ## Versions
    
    - [x] dev
    - [x] 4.x
    - [ ] 3.x
    - [ ] 2.1
    
    ## Languages
    
    - [x] Chinese
    - [x] English
    
    ## Docs Checklist
    
    - [ ] Checked by AI
    - [ ] Test Cases Built
---
 docs/ai/vector-search/behind-index.md              | 240 ++++++++++++++++++++
 docs/ai/vector-search/overview.md                  |  46 +---
 .../current/ai/vector-search/behind-index.md       | 251 +++++++++++++++++++++
 .../current/ai/vector-search/overview.md           |  44 +---
 .../version-4.x/ai/vector-search/behind-index.md   | 251 +++++++++++++++++++++
 .../version-4.x/ai/vector-search/overview.md       |  47 ++--
 sidebars.ts                                        |   1 +
 static/images/vector-search/image-1.png            | Bin 0 -> 61728 bytes
 static/images/vector-search/image-2.png            | Bin 0 -> 117065 bytes
 static/images/vector-search/image-3.png            | Bin 0 -> 134948 bytes
 static/images/vector-search/image-4.png            | Bin 0 -> 88273 bytes
 static/images/vector-search/image.png              | Bin 0 -> 750864 bytes
 .../version-4.x/ai/vector-search/behind-index.md   | 240 ++++++++++++++++++++
 .../version-4.x/ai/vector-search/overview.md       |  45 +---
 versioned_sidebars/version-4.x-sidebars.json       |   3 +-
 15 files changed, 1033 insertions(+), 135 deletions(-)

diff --git a/docs/ai/vector-search/behind-index.md 
b/docs/ai/vector-search/behind-index.md
new file mode 100644
index 00000000000..1dd1b0b3801
--- /dev/null
+++ b/docs/ai/vector-search/behind-index.md
@@ -0,0 +1,240 @@
+---
+{
+    "title": "Optimizations Behind Performance",
+    "language": "en"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+Early versions of Apache Doris focused on online analytical processing (OLAP), 
primarily for reporting and aggregation workloads—typical queries being 
multi-table JOIN and GROUP BY. In 2.x, Doris added text search via inverted 
indexes and introduced the Variant type for efficient JSON handling. In 3.x, 
storage-compute separation enabled leveraging object storage to significantly 
reduce storage costs. In 4.x, Doris steps into the AI era by introducing vector 
indexes and hybrid search (vec [...]
+
+We divide vector indexing into two stages: indexing and querying. The indexing 
stage focuses on 1) data sharding, 2) efficiently building high-quality 
indexes, and 3) index management. The querying stage has a single goal: improve 
query performance—eliminating redundant computation and unnecessary IO while 
optimizing concurrency.
+
+## Indexing Stage
+Indexing performance is strongly tied to index hyperparameters: higher index 
quality typically means longer build time. Thanks to optimizations in the 
ingestion path, Doris can maintain high index quality while improving ingestion 
throughput.
+
+On the 768D 10M dataset, Apache Doris achieves industry-leading ingestion 
performance.
+
+![alt text](/images/vector-search/image-1.png)
+
+### Multi-Level Sharding
+Internal tables in Apache Doris are inherently distributed. During query and 
ingestion, users interact with a single logical table, while the Doris kernel 
creates the required number of physical tablets based on the table definition. 
During ingestion, data is routed to the appropriate BE tablet by partition and 
bucket keys. Multiple tablets together form the logical table seen by users. 
Each ingestion request forms a transaction, creating a rowset (versioning unit) 
on the corresponding t [...]
+
+![Hierarchy from table to shards](/images/vector-search/image.png)
+
+Vector indexes (e.g., HNSW) rely on key hyperparameters that directly 
determine index quality and query performance, and are typically tuned for 
specific data scales. Apache Doris’s multi-level sharding decouples “index 
parameters” from the “full table data scale”: users need not rebuild indexes as 
total data grows, but only tune parameters based on per-batch ingestion size. 
From our tests, HNSW suggested parameters under different batch sizes are:
+
+| batch_size | max_degree | ef_construction | ef_search | recall@100 |
+|------------|------------|-----------------|-----------|------------|
+| 250000     | 100        | 200             | 50        | 89%        |
+| 250000     | 100        | 200             | 100       | 93%        |
+| 250000     | 100        | 200             | 150       | 95%        |
+| 250000     | 100        | 200             | 200       | 98%        |
+| 500000     | 120        | 240             | 50        | 91%        |
+| 500000     | 120        | 240             | 100       | 94%        |
+| 500000     | 120        | 240             | 150       | 96%        |
+| 500000     | 120        | 240             | 200       | 99%        |
+| 1000000    | 150        | 300             | 50        | 90%        |
+| 1000000    | 150        | 300             | 100       | 93%        |
+| 1000000    | 150        | 300             | 150       | 96%        |
+| 1000000    | 150        | 300             | 200       | 98%        |
+
+In short, focus on “per-batch ingestion size” and choose proper index 
parameters to maintain quality and stable query behavior.
+
+### High-Performance Index Building
+
+#### Parallel, High-Quality Index Construction
+
+Doris accelerates index builds with two-level parallelism: cluster-level 
parallelism across BE nodes, and intra-node multithreaded distance computation 
on grouped batch data. Beyond speed, Doris improves index quality via in-memory 
batching: when the total vector count is fixed but batching is too fine 
(frequent incremental builds), graph structures become sparser and recall 
drops. For example, on 768D10M, building in 10 batches may reach ~99% recall, 
while 100 batches may drop to ~95%.  [...]
+
+#### SIMD
+
+The core cost in ANN index building is large-scale distance computation—a 
CPU-bound workload. Doris centralizes this work on BE nodes, implemented in 
C++, and leverages Faiss’s automatic and manual vectorization optimizations. 
For L2 distance, Faiss uses compiler pragmas to trigger auto-vectorization:
+```cpp
+FAISS_PRAGMA_IMPRECISE_FUNCTION_BEGIN
+float fvec_L2sqr(const float* x, const float* y, size_t d) {
+    size_t i; float res = 0;
+    FAISS_PRAGMA_IMPRECISE_LOOP
+    for (i = 0; i < d; i++) {
+        const float tmp = x[i] - y[i];
+        res += tmp * tmp;
+    }
+    return res;
+}
+FAISS_PRAGMA_IMPRECISE_FUNCTION_END
+```
+With `FAISS_PRAGMA_IMPRECISE_*`, compilers auto-vectorize:
+```cpp
+#define FAISS_PRAGMA_IMPRECISE_LOOP \
+    _Pragma("clang loop vectorize(enable) interleave(enable)")
+```
+Faiss also applies explicit SIMD in `#ifdef SSE3/AVX2/AVX512F` blocks using 
`_mm*`/`_mm256*`/`_mm512*`, combined with `ElementOpL2/ElementOpIP` and 
dimension-specialized `fvec_op_ny_D{1,2,4,8,12}` to:
+- Process multiple samples per iteration (e.g., 8/16) and perform 
register-level transpose to improve memory access locality;
+- Use FMA (e.g., `_mm512_fmadd_ps`) to fuse multiply-add and reduce 
instruction count;
+- Do horizontal sums to produce scalars efficiently;
+- Handle tail elements via masked reads for non-aligned sizes.
+These optimizations reduce instruction and memory costs and significantly 
boost indexing throughput.
+
+## Querying Stage
+
+Search is latency sensitive. At tens of millions of records with high 
concurrency, P99 latency typically needs to be under 500 ms—raising the bar for 
the optimizer, execution engine, and index implementation. Out-of-the-box tests 
show Doris reaches performance comparable to mainstream dedicated vector 
databases. The chart below compares Doris against other systems on 
Performance768D10M; peer data comes from Zilliz’s open-source 
[VectorDBBench](https://github.com/zilliztech/VectorDBBench).
+
+![alt text](/images/vector-search/image-2.png)
+
+> Note: The chart includes a subset of out-of-the-box results. OpenSearch and 
Elastic Cloud can improve query performance by optimizing the number of index 
files.
+
+### Prepare Statement
+In the traditional path, Doris runs full optimization (parsing, semantic 
analysis, RBO, CBO) for every SQL. While essential for general OLAP, this adds 
overhead for simple, highly repetitive search queries. Doris 4.0 extends 
Prepare Statement beyond point lookups to all SQL types, including vector 
search:
+1. Separate compile and execute
+   - Prepare performs parsing, semantics, and optimization once, producing a 
reusable Logical Plan.
+   - Execute binds parameters at runtime and runs the pre-built plan, skipping 
the optimizer entirely.
+2. Plan cache
+   - Reuse is determined by SQL fingerprint (normalized SQL + schema version).
+   - Different parameter values with the same structure reuse the cached plan, 
avoiding re-optimization.
+3. Schema version check
+   - Validate schema version at execution to ensure correctness.
+   - No change → reuse; changed → invalidate and re-prepare.
+4. Speedup by skipping optimizer
+   - Execute no longer runs RBO/CBO; optimizer time is nearly eliminated.
+   - Template-heavy vector queries benefit with significantly lower end-to-end 
latency.
+
+### Index Only Scan
+Doris implements vector indexes as external (pluggable) indexes, which 
simplifies management and supports asynchronous builds, but introduces 
performance challenges such as avoiding redundant computation and IO. ANN 
indexes can return distances in addition to row IDs. Doris leverages this by 
short-circuiting distance expressions within the Scan operator via “virtual 
columns,” and the Ann Index Only Scan fully eliminates distance-related read IO.
+In the naive flow, Scan pushes predicates to the index, the index returns row 
IDs, and Scan then reads data pages and computes expressions before returning N 
rows upstream.
+
+![alt text](/images/vector-search/image-3.png)
+
+With Index Only Scan applied, the flow becomes:
+
+![alt text](/images/vector-search/image-4.png)
+
+For example, `SELECT l2_distance_approximate(embedding, [...]) AS dist FROM 
tbl ORDER BY dist LIMIT 100;` executes without touching data files.
+
+Beyond Ann TopN Search, Range Search and Compound Search adopt similar 
optimizations. Range Search is more nuanced: whether the index returns `dist` 
depends on the comparator. Below lists query types related to Ann Index Only 
Scan and whether Index Scan applies:
+
+```SQL
+-- Sql1: Range + proj
+-- Index returns dist; no need to recompute dist
+-- Virtual column for CSE avoids dist recomputation in proj
+-- IndexScan: True
+select id, dist(embedding, [...]) from tbl where dist <= 10;
+
+-- Sql2: Range + no-proj
+-- Index returns dist; no need to recompute
+-- IndexScan: True
+select id from tbl where dist <= 10 order by id limit N;
+
+-- Sql3: Range + proj + no-dist-from index
+-- Index cannot return dist (only updates rowid map)
+-- proj requires dist → embedding must be reread
+-- IndexScan: False
+select id, dist(embedding, [...]) from tbl where dist > 10;
+
+-- Sql4: Range + proj + no-dist-from index
+-- Index cannot return dist, but proj does not need dist → embedding not reread
+-- IndexScan: True
+select id from tbl where dist > 10;
+
+-- Sql5: TopN
+-- Index returns dist; virtual slot for CSE uploads dist to proj
+-- embedding column not read
+-- IndexScan: True
+select id[, dist(embedding, [...])] from tbl order by dist(embedding, [...]) 
asc limit N;
+
+-- Sql6: TopN + IndexFilter
+-- 1) comment not read (inverted index already optimizes this)
+-- 2) embedding not read (same reason as Sql5)
+-- IndexScan: True
+select id[, dist(embedding, [...])] from tbl where comment match_any 'olap' 
ORDER BY dist(embedding, [...]) LIMIT N;
+
+-- Sql7: TopN + Range
+-- IndexScan: True (combination of Sql1 and Sql5)
+select id[, dist(embedding, [...])] from tbl where dist(embedding, [...]) > 10 
order by dist(embedding, [...]) limit N;
+
+-- Sql8: TopN + Range + IndexFilter
+-- IndexScan: True (combination of Sql7 and Sql6)
+select id[, dist(embedding, [...])] from tbl where comment match_any 'olap' 
and dist(embedding, [...]) > 10 ORDER BY dist(embedding, [...]) LIMIT N;
+
+-- Sql9: TopN + Range + CommonFilter
+-- Key points: 1) dist < 10 (not > 10); 2) common filter reads dist, not 
embedding
+-- Index returns dist; virtual slot for CSE ensures all reads refer to the 
same column
+-- In theory embedding need not materialize; in practice it still does due to 
residual predicates on the column
+-- IndexScan: False
+select id[, dist(embedding, [...])] from tbl where comment match_any 'olap' 
and dist(embedding, [...]) < 10 AND abs(dist(embedding) + 10) > 10 ORDER BY 
dist(embedding, [...]) LIMIT N;
+
+-- Sql10: Variant of Sql9, dist < 10 → dist > 10
+-- Index cannot return embedding; to compute abs(dist(embedding,...)) 
embedding must materialize
+-- IndexScan: False
+select id[, dist(embedding, [...])] from tbl where comment match_any 'olap' 
and dist(embedding, [...]) > 10 AND abs(dist(embedding) + 10) > 10 ORDER BY 
dist(embedding, [...]) LIMIT N;
+
+-- Sql11: Variant of Sql9, abs(dist(...)+10) > 10 → array_size(embedding) > 10
+-- array_size requires embedding materialization
+-- IndexScan: False
+select id[, dist(embedding, [...])] from tbl where comment match_any 'olap' 
and dist(embedding, [...]) < 10 AND array_size(embedding) > 10 ORDER BY 
dist(embedding, [...]) LIMIT N;
+```
+
+### Virtual Columns for CSE
+
+Index Only Scan mainly eliminates IO (random reads of embedding). To further 
remove redundant computation, Doris introduces virtual columns that pass 
index-returned `dist` into the expression engine as a column.
+Design highlights:
+1. Expression node `VirtualSlotRef`
+2. Column iterator `VirtualColumnIterator`
+
+`VirtualSlotRef` is a compute-time-generated column: materialized by one 
expression, reusable by many, computed once on first use—eliminating CSE across 
Projection and predicates. `VirtualColumnIterator` materializes index-returned 
distances into expressions, avoiding repeated distance calculations. Initially 
built for ANN query CSE elimination, the mechanism was generalized to 
Projection + Scan + Filter.
+Using the ClickBench dataset, the query below counts the top 20 websites by 
Google clicks:
+```sql
+set experimental_enable_virtual_slot_for_cse=true;
+
+SELECT counterid,
+       COUNT(*)               AS hit_count,
+       COUNT(DISTINCT userid) AS unique_users
+FROM   hits
+WHERE  ( UPPER(regexp_extract(referer, '^https?://([^/]+)', 1)) = 'GOOGLE.COM'
+         OR UPPER(regexp_extract(referer, '^https?://([^/]+)', 1)) = 
'GOOGLE.RU'
+         OR UPPER(regexp_extract(referer, '^https?://([^/]+)', 1)) LIKE 
'%GOOGLE%' )
+       AND ( LENGTH(regexp_extract(referer, '^https?://([^/]+)', 1)) > 3
+              OR regexp_extract(referer, '^https?://([^/]+)', 1) != ''
+              OR regexp_extract(referer, '^https?://([^/]+)', 1) IS NOT NULL )
+       AND eventdate = '2013-07-15'
+GROUP  BY counterid
+HAVING hit_count > 100
+ORDER  BY hit_count DESC
+LIMIT  20;
+```
+The core expression `regexp_extract(referer, '^https?://([^/]+)', 1)` is 
CPU-intensive and reused across predicates. With virtual columns enabled (`set 
experimental_enable_virtual_slot_for_cse=true;`):
+- Enabled: 0.57 s
+- Disabled: 1.50 s
+
+End-to-end performance improves by ~3x.
+
+### Scan Parallelism Optimization
+Doris revamped Scan parallelism for Ann TopN Search. The original policy set 
parallelism by row count (default: 2,097,152 rows per Scan Task). Because 
segments are size-based, high-dimensional vector columns produce far fewer rows 
per segment, leading to multiple segments being scanned serially within one 
Scan Task. Doris switched to “one Scan Task per segment,” boosting parallelism 
in index scanning; given Ann TopN’s high filter rate (only N rows returned), 
the back-to-table phase can r [...]
+`set optimize_index_scan_parallelism=true;`
+TopN single-threaded query latency drops from 230 ms to 50 ms.
+Additionally, 4.0 introduces dynamic parallelism: before each scheduling 
round, Doris adjusts the number of submitted Scan tasks based on thread-pool 
pressure—reducing tasks under high load, increasing when idle—to balance 
resource use and scheduling overhead across serial and concurrent workloads.
+
+### Global TopN Delayed Materialization
+A typical Ann TopN query executes in two stages:
+1. Scan obtains per-segment TopN distances via the index;
+2. Global sort merges per-segment TopN to produce the final TopN.
+
+If the projection returns many columns or large types (e.g., String), stage-1 
reading N rows from each segment can incur heavy IO—and many rows are discarded 
during stage-2 global sort. Doris minimizes stage-1 IO via global TopN delayed 
materialization.
+For `SELECT id, l2_distance_approximate(embedding, [...]) AS dist FROM tbl 
ORDER BY dist LIMIT 100;`: stage-1 outputs only 100 `dist` values and rowids 
per segment via Ann Index Only Scan + virtual columns. With M segments, stage-2 
globally sorts `100 * M` `dist` values to obtain the final TopN and rowids, 
then the Materialize operator fetches the needed columns by rowid from 
corresponding tablet/rowset/segment.
\ No newline at end of file
diff --git a/docs/ai/vector-search/overview.md 
b/docs/ai/vector-search/overview.md
index 5ecfbcf5fb5..2f3fb6b817c 100644
--- a/docs/ai/vector-search/overview.md
+++ b/docs/ai/vector-search/overview.md
@@ -100,38 +100,7 @@ SELECT count(*) FROM sift_1M
 |  1000000 |
 +----------+
 ```
-
-The SIFT dataset ships with a ground-truth set for result validation. Pick one 
query vector and first run an exact Top-N using the precise distance:
-
-```sql
-SELECT id,
-       L2_distance(
-        embedding,
-        
[0,11,77,24,3,0,0,0,28,70,125,8,0,0,0,0,44,35,50,45,9,0,0,0,4,0,4,56,18,0,3,9,16,17,59,10,10,8,57,57,100,105,125,41,1,0,6,92,8,14,73,125,29,7,0,5,0,0,8,124,66,6,3,1,63,5,0,1,49,32,17,35,125,21,0,3,2,12,6,109,21,0,0,35,74,125,14,23,0,0,6,50,25,70,64,7,59,18,7,16,22,5,0,1,125,23,1,0,7,30,14,32,4,0,2,2,59,125,19,4,0,0,2,1,6,53,33,2]
-       ) AS distance
-FROM sift_1m
-ORDER BY distance
-LIMIT 10;
---------------
-
-+--------+----------+
-| id     | distance |
-+--------+----------+
-| 178811 | 210.1595 |
-| 177646 | 217.0161 |
-| 181997 | 218.5406 |
-| 181605 | 219.2989 |
-| 821938 | 221.7228 |
-| 807785 | 226.7135 |
-| 716433 | 227.3148 |
-| 358802 | 230.7314 |
-| 803100 | 230.9112 |
-| 866737 | 231.6441 |
-+--------+----------+
-10 rows in set (0.29 sec)
-```
-
-When using `l2_distance` or `inner_product`, Doris computes the distance 
between the query vector and all 1,000,000 candidate vectors, then applies a 
TopN operator globally. Using `l2_distance_approximate` / 
`inner_product_approximate` triggers the index path:
+Using `l2_distance_approximate` / `inner_product_approximate` triggers the ANN 
index path. The function must match the index `metric_type` exactly (e.g., 
`metric_type=l2_distance` → use `l2_distance_approximate`; 
`metric_type=inner_product` → use `inner_product_approximate`). For ordering: 
L2 uses ascending distance (smaller is closer); inner product uses descending 
score (larger is closer).
 
 ```sql
 SELECT id,
@@ -161,11 +130,18 @@ LIMIT 10;
 10 rows in set (0.02 sec)
 ```
 
-With the ANN index, query latency in this example drops from about 290 ms to 
20 ms.
+To compare with exact ground truth, use `l2_distance` or `inner_product` 
(without the `_approximate` suffix). In this example, exact search takes ~290 
ms:
+```
+10 rows in set (0.29 sec)
+```
+
+With the ANN index, query latency drops from ~290 ms to ~20 ms in this example.
 
-ANN indexes are built at the segment granularity. Because tables are 
distributed, after each segment returns its local TopN, the TopN operator 
merges results across tablets and segments to produce the global TopN.
+ANN indexes are built at segment granularity. In distributed tables, each 
segment returns its local TopN; then the TopN operator merges results across 
tablets and segments to produce the global TopN.
 
-Note: When `metric_type = l2_distance`, a smaller distance means closer 
vectors. For `inner_product`, a larger value means closer vectors. Therefore, 
if using `inner_product`, you must use `ORDER BY dist DESC` to obtain TopN via 
the index.
+Note on ordering:
+- For `metric_type = l2_distance`, smaller distance = closer vectors → use 
`ORDER BY dist ASC`.
+- For `metric_type = inner_product`, larger value = closer vectors → use 
`ORDER BY dist DESC` to obtain TopN via the index.
 
 ## Approximate Range Search
 
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/behind-index.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/behind-index.md
new file mode 100644
index 00000000000..91e01f4c823
--- /dev/null
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/behind-index.md
@@ -0,0 +1,251 @@
+---
+{
+    "title": "性能测试背后的优化",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+早期版本的 Apache Doris 是一个面向在线数据分析处理的系统，主要处理的场景是报表分析和聚合数据分析，最典型的查询是多表 JOIN 以及 
GROUP BY 聚合查询。在 2.X 版本中实现了基于倒排索引的文本检索功能，引入了 Variant 数据类型来高效处理 JSON。在 3.X 
版本中引入了存算分离特性使得 Apache Doris 可以利用对象存储极大降低存储成本，而 4.X 版本则是通过引入向量索引使得 Apache Doris 
正式迈入 AI 时代，利用向量搜索与文本搜索提供的混合搜索能力，Doris 将会成为企业的 AI 数据分析核心平台。这里我们会介绍 Doris 在 4.X 
版本中是如何实现向量索引的，以及为了使其性能追上并达到业界先进水平，Doris 做了哪些工作。
+
+我们把向量索引的实现分为两个大部分，第一个部分是索引阶段，索引阶段需要解决的问题是：1. 数据分片；2. 高效构建高质量索引；3. 
索引管理。第二个部分则是查询阶段，查询阶段只有一个核心目标，如何提升查询性能，这其中我们会面临很多问题，比如如何最大程度消除重复计算与多余的磁盘IO，如何优化并发性能等等。
+
+## 索引阶段
+索引阶段的性能和索引的超参数强相关，如果需要一个更高的索引质量，那么势必会导致索引时间变长，得益于 Apache Doris 
在数据导入路径上的优化，Doris 可以在保持高质量索引的同时提高导入的性能。
+
+在 768 维 10M 行的数据规模上进行测试，Apache Doris 的导入性能处于业界先进水平
+
+![alt text](/images/vector-search/image-1.png)
+
+### 多层级分片
+Apache Doris 的内表天然是分布式表。用户在查询或导入时仅感知到一张逻辑表（Table），而 Doris 
内核会依据表定义自动创建满足数量要求的物理表（Tablet），并在导入过程中按分区键与分桶键将数据路由到对应 BE 的 tablet。多个 tablet 
共同组成用户看到的 table。每次导入都会形成一个导入事务，并在对应的 tablet 上生成一个 rowset（用于版本控制的逻辑单位）。每个 rowset 
下包含若干个 segment，真正承载数据的是 segment，ANN 索引也作用于 segment 粒度。
+
+![表到分片层级示意](/images/vector-search/image.png)
+
+向量索引（如 HNSW）依赖多个关键超参数，这些参数直接决定索引质量与查询性能，并通常在固定数据规模下才能达到理想效果。**Apache Doris 
的多层级分片将“索引参数”与“整表数据规模”解耦：用户无需因数据总量增长而重建索引，只需关注每批次的导入规模与相应参数设置。** 基于我们的测试，HNSW 
索引在不同批次规模下的经验参数如下：
+
+| batch_size | max_degree | ef_construction | ef_search | recall@100 |
+|------------|------------|-----------------|-----------|------------|
+| 250000     | 100        | 200             | 50        | 89%        |
+| 250000     | 100        | 200             | 100       | 93%        |
+| 250000     | 100        | 200             | 150       | 95%        |
+| 250000     | 100        | 200             | 200       | 98%        |
+| 500000     | 120        | 240             | 50        | 91%        |
+| 500000     | 120        | 240             | 100       | 94%        |
+| 500000     | 120        | 240             | 150       | 96%        |
+| 500000     | 120        | 240             | 200       | 99%        |
+| 1000000    | 150        | 300             | 50        | 90%        |
+| 1000000    | 150        | 300             | 100       | 93%        |
+| 1000000    | 150        | 300             | 150       | 96%        |
+| 1000000    | 150        | 300             | 200       | 98%        |
+
+换言之，用户只需聚焦“每一批次的导入数据量”，并据此选择合适的索引参数，即可在保证索引质量的同时获得稳定的查询表现。
+
+### 高性能索引构建
+
+#### 并行高质量索引构建
+
+Apache Doris 采用“双层并行”加速索引构建：一方面通过多台 BE 节点实现集群级并行；另一方面在每台 BE 
内，对同一批数据分组进行多线程并行的距离计算，以提升索引数据结构的构建速度。在“快”的同时，Doris 
通过内存赞批提升索引质量：当总向量数固定但分批过细、频繁追加索引时，图结构容易稀疏、召回率下降。例如对 768D10M 的向量，分 10 次构建索引可达约 
99% 召回，若改为分 100 次则可能降至约 95%。通过内存赞批，在相同超参数下可更好地平衡内存占用与图质量，避免因过度分批导致的质量劣化。
+
+#### SIMD
+
+ANN 索引构建的核心成本在大规模距离计算，属于典型 CPU 密集型任务。Apache Doris 将这部分计算集中在 BE 节点，相关实现均以 C++ 
编写，并充分利用 Faiss 的自动与手动向量化优化。以 L2 距离为例，Faiss 通过编译器辅导宏触发自动向量化，代码示例如下：
+```cpp
+FAISS_PRAGMA_IMPRECISE_FUNCTION_BEGIN
+float fvec_L2sqr(const float* x, const float* y, size_t d) {
+    size_t i;
+    float res = 0;
+    FAISS_PRAGMA_IMPRECISE_LOOP
+    for (i = 0; i < d; i++) {
+        const float tmp = x[i] - y[i];
+        res += tmp * tmp;
+    }
+    return res;
+}
+FAISS_PRAGMA_IMPRECISE_FUNCTION_END
+```
+上述 `FAISS_PRAGMA_IMPRECISE_*` 宏可引导编译器进行自动向量化：
+```cpp
+#define FAISS_PRAGMA_IMPRECISE_LOOP \
+    _Pragma("clang loop vectorize(enable) interleave(enable)")
+```
+同时，Faiss 在 `#ifdef SSE3/AVX2/AVX512F` 条件块中使用 `_mm*`/`_mm256*`/`_mm512*` 
指令进行显式向量化；结合模板 `ElementOpL2/ElementOpIP`（分别实现 L2 与点积的逐分量操作）与维度特化 
`fvec_op_ny_D{1,2,4,8,12}`，实现：
+- 批量处理多条样本（如 8/16），并通过寄存器内矩阵转置（如 transpose_8x2/16x4/...）提升访问连续性；
+- 使用 FMA 指令（如 `_mm512_fmadd_ps`）合并乘加以减少指令数；
+- 通过水平求和（horizontal sum）快速得到标量结果；
+- 以 masked 分支处理非 4/8/16 对齐的尾元素。
+这些优化有效压缩距离计算的指令与访存开销，显著提升索引构建吞吐。
+
+## 查询阶段
+
+搜索场景对延迟极为敏感。在千万级数据量与高并发查询的场景下，通常需要将 P99 延迟控制在 500 ms 以内。这对 Doris 
的优化器、执行引擎以及索引实现都提出了更高要求。开箱即用的测试表明，Apache Doris 的查询性能已达到业界主流专用向量数据库的水平。下图展示了 
Apache Doris 与其他具备向量搜索能力的数据库在 Performance768D10M 数据集上的对比；其他数据库数据来自 Zilliz 开源的 
[VectorDBBench](https://github.com/zilliztech/VectorDBBench) 框架。
+
+![alt text](/images/vector-search/image-2.png)
+
+> 注：图中仅包含部分数据库的开箱测试结果。OpenSearch 与 Elastic Cloud 可通过优化索引文件数量进一步提升查询性能。
+
+### Prepare Statement
+在传统执行路径中，Doris 会对每条 SQL 执行完整优化流程（语法解析、语义分析、RBO、CBO）。这在通用 OLAP 
场景必不可少，但在搜索等简单且高度重复的查询模式中会产生明显的额外开销。为此，Doris 4.0 扩展了 Prepare 
Statement，使其不仅支持点查，也适用于包含向量检索在内的所有 SQL 类型。核心思路如下：
+1. 分离编译与执行
+    - Prepare 阶段一次性完成解析、语义与优化，生成可复用的逻辑计划（Logical Plan）。
+    - Execute 阶段仅绑定实参并直接执行已生成的计划，完全跳过优化器。
+2. 计划缓存（Plan Cache）
+    - 按 SQL 指纹（normalized SQL + schema version）判断计划是否可复用。
+    - 参数值不同但结构一致时仍可直接复用，避免重复优化。
+3. Schema Version 校验
+    - 执行时校验表结构版本，确保计划正确性。
+    - schema 未变化 → 直接复用；已变化 → 自动失效并重新 Prepare。
+4. 跳过优化器带来显著加速
+    - Execute 不再运行 RBO/CBO，优化器耗时几乎被完全消除。
+    - 在向量检索这类模板化查询中，Prepare 可显著降低端到端延迟。
+
+### Index Only Scan
+Apache Doris 的向量索引采用外挂方式。外挂索引便于管理与异步构建，但也带来性能挑战：如何避免重复计算与多余 IO。ANN 
索引除返回命中行号外，还可返回向量间距离。为高效利用这些额外信息，执行引擎在 Scan 算子阶段对距离相关表达式进行“提前短路”。Doris 
通过“虚拟列”机制自动完成该短路，并以 Ann Index Only Scan 完全消除与距离计算相关的读 IO。
+在朴素流程中，Scan 将谓词下推至索引，索引返回行号；随后 Scan 按行号读取数据页（Data Page），再计算表达式并向上返回 N 行结果。
+
+![alt text](/images/vector-search/image-3.png)
+
+应用 Index Only Scan 后，流程变为：
+
+![alt text](/images/vector-search/image-4.png)
+
+例如 `SELECT l2_distance_approximate(embedding, [...]) AS dist FROM tbl ORDER BY 
dist LIMIT 100;`，执行过程将不再触发数据文件 IO。
+
+除 Ann TopN Search 外，支持索引加速的 Range Search 与复合检索（Compound Search）也采用类似优化。Range 
Search 较 TopN 更复杂：不同比较方式决定索引是否能返回 dist。以下梳理与 Ann Index Only Scan 相关的查询类型及其是否可被 
Index Scan 优化：
+
+```SQL
+-- Sql1
+-- Range + proj
+-- Ann 索引可以返回 dist，所以 dist 不需要再次计算
+-- 同时 virtual column for cse 的优化避免了 proj 里面的 dist 计算
+-- IndexScan: True
+select id, dist(embedding, [...]) from tbl where dist <= 10;
+
+-- Sql2
+-- Range + no-proj
+-- Ann 索引可以返回 dist，所以 dist 不需要再次计算
+-- IndexScan: True
+select id from tbl where dist <= 10 order by id limit N;
+
+-- Sql3
+-- Range + proj + no-dist-from index
+-- Ann 索引无法返回 dist(索引只能更新 rowid map)
+-- 由于 proj 里面要求返回 dist 因此 embedding 需要重读
+-- IndexScan: False
+select id, dist(embedding, [...]) from tbl where dist > 10;
+
+-- Sql4
+-- Range + proj + no-dist-from index
+-- Ann 索引无法返回 dist(索引只能更新 rowid map)
+-- 但是 proj 里面不需要 dist，因此 embedding 不需要重新读
+-- IndexScan: True
+select id from tbl where dist > 10;
+
+-- Sql5
+-- TopN
+-- AnnIndex 返回 dist，virtual slot for cse 确保了索引的 dist 被上传到 proj
+-- 因此不需要读 embedding 列
+-- IndexScan: True
+select id[, dist(embedding, [...])] from tbl order by dist(embedding, [...]) 
asc limit N;
+
+-- Sql6
+-- TopN + IndexFilter
+-- 1. comment 列不需要读，inverted index scan 已经做了这个优化
+-- 2. embedding 列不需要读，原因与 sql5 一样
+-- IndexScan: True
+select id[, dist(embedding, [...])] from tbl where comment match_any 'olap' 
ORDER BY dist(embedding, [...]) LIMIT N;
+
+-- Sql7
+-- TopN + Range
+-- IndexScan:True，原因是 Sql1 与 Sql5 组合
+select id[, dist(embedding, [...])] from tbl where dist(embedding, [...]) > 10 
order by dist(embedding, [...]) limit N;
+
+-- Sql8
+-- TopN + Range + IndexFilter
+-- INdexScan:True，原因是 Sql7 与 Sql6 组合
+select id[, dist(embedding, [...])] from tbl where comment match_any 'olap' 
and dist(embedding, [...]) > 10 ORDER BY dist(embedding, [...]) LIMIT N;
+
+-- Sql9
+-- TopN + Range + CommonFilter
+-- 这里重点： 1. dist < 10 而不是 dist > 10; 2. common filter 没有直接读 embedding，而是读的 dist
+-- Ann index 可以返回 dist，virtual slot ref for cse 确保了所有对 dist 的读都是同一个列
+-- 此时虽然 ann topn 无法 apply，理论上 embedding 列依然全程不需要物化
+-- 但是实际中，依然还是会物化 embedding，因为目前判断某个列是否可以 skip reading，是靠判断这个列上的谓词是否还有残留，common 
filter 本身无法被消除，所以现在代码上是会判断需要物化的。
+-- 这个优化点ROI不高，因此不做了
+-- IndexScan: False 
+select id[，dist(embedding, [...])] from tbl where where comment match_any 
'olap' and dist(embedding, [...]) < 10 AND abs(dist(embedding) + 10) > 10 ORDER 
BY dist(embedding, [...]) LIMIT N;
+
+-- Sql10
+-- Sql9 的变种，dist < 10 变成了 dist > 10，此时 index 无法返回 embedding
+-- 因此为了计算 abs(dist(embedding, [...]) 需要物化 embedding
+-- IndexScan: False
+select id[，dist(embedding, [...])] from tbl where where comment match_any 
'olap' and dist(embedding, [...]) > 10 AND abs(dist(embedding) + 10) > 10 ORDER 
BY dist(embedding, [...]) LIMIT N;
+
+-- Sql11
+-- Sql9 的变种，abs(dist(embedding) + 10) > 10 变成了 array_size(embedding) > 10，区别在于 
array_size 强制要求 embedding 的物化
+-- 为了计算 array_size(embedding, [...]) 需要物化 embedding
+-- IndexScan: False
+select id[，dist(embedding, [...])] from tbl where where comment match_any 
'olap' and dist(embedding, [...]) < 10 AND array_size(embedding) > 10 ORDER BY 
dist(embedding, [...]) LIMIT N;
+```
+
+### 虚拟列优化公共子表达式
+
+Index Only Scan 主要解决 IO 问题，避免了对 embedding 的大量随机读。为进一步消除重复计算，Doris 
在计算层引入“虚拟列”机制，将索引返回的 dist 以列形式传递给表达式执行器。
+虚拟列的设计要点：
+1. 引入表达式节点 `VirtualSlotRef`；
+2. 引入列迭代器 `VirtualColumnIterator`。
+
+`VirtualSlotRef` 表示“计算时生成”的特殊列，由某个表达式物化且可被多个表达式共享，仅首次使用时计算一次，从而消除 Projection 
与谓词中的公共子表达式（CSE）重复计算。`VirtualColumnIterator` 
用于将索引返回的距离物化到表达式，避免重复的距离函数计算。该机制最初用于 ANN 相关查询的 CSE 消除，随后扩展至通用的 Projection + 
Scan + Filter 组合。基于 ClickBench 数据集，以下查询统计从 Google 获得最多点击的 20 个网站：
+```sql
+set experimental_enable_virtual_slot_for_cse=true;
+
+SELECT counterid,
+       COUNT(*)               AS hit_count,
+       COUNT(DISTINCT userid) AS unique_users
+FROM   hits
+WHERE  ( UPPER(regexp_extract(referer, '^https?://([^/]+)', 1)) = 'GOOGLE.COM'
+         OR UPPER(regexp_extract(referer, '^https?://([^/]+)', 1)) = 
'GOOGLE.RU'
+         OR UPPER(regexp_extract(referer, '^https?://([^/]+)', 1)) LIKE 
'%GOOGLE%' )
+       AND ( LENGTH(regexp_extract(referer, '^https?://([^/]+)', 1)) > 3
+              OR regexp_extract(referer, '^https?://([^/]+)', 1) != ''
+              OR regexp_extract(referer, '^https?://([^/]+)', 1) IS NOT NULL )
+       AND eventdate = '2013-07-15'
+GROUP  BY counterid
+HAVING hit_count > 100
+ORDER  BY hit_count DESC
+LIMIT  20;
+```
+核心表达式 `regexp_extract(referer, '^https?://([^/]+)', 1)` 为 CPU 
密集型且被多处复用。启用虚拟列优化（`set experimental_enable_virtual_slot_for_cse=true;`）后：
+- 开启优化：0.57 s
+- 关闭优化：1.50 s
+
+端到端性能提升约 3 倍。
+
+### Scan 并行度优化
+Doris 针对 Ann TopN Search 重构了 Scan 并行策略。原策略按“行数”决定并行度（默认 2,097,152 行对应 1 个 Scan 
Task）。由于 segment 基于 size 创建，高维向量列会使单 segment 行数远低于该阈值，导致一个 Scan Task 内出现多个 
segment 串行扫描、进而影响性能。Doris 改为“严格按 segment 创建 Scan Task”，提升索引分析阶段的并行度；由于 Ann TopN 
Search 的过滤率极高（只返回 N 行），回表阶段即便串行也不影响整体性能。以 SIFT 1M 为例：`set 
optimize_index_scan_parallelism=true;`开启后 TopN 串行查询耗时由 230 ms 降至 50 ms。
+此外，4.0 引入“动态并行度调整”：每轮调度前根据 Scan 
线程池压力动态决定可提交的任务数；压力大则减并行、资源空闲则增并行，以在串行与高并发场景间兼顾资源利用率与调度开销。
+### 全局 TopN 延迟物化
+典型的 Ann TopN 查询包含两阶段：
+1. Scan 算子通过索引获取各 segment 的 TopN 距离；
+2. 全局排序节点对各 segment 的 TopN 进行合并排序，得到最终 TopN。
+
+若 projection 返回多列或包含大列（如 String），阶段一从每个 segment 读取的 N 行可能造成大量磁盘 
IO，且在阶段二的全局排序中被丢弃（非最终 TopN）。Doris 通过“全局 TopN 延迟物化”最大限度减少阶段一读取量。
+以 `SELECT id, l2_distance_approximate(embedding, [...]) AS dist FROM tbl ORDER 
BY dist LIMIT 100;` 为例：阶段一每个 segment 通过 Ann Index Only Scan + 虚拟列仅输出 100 个 
`dist` 及其 `rowid`；若共有 M 个 segment，阶段二对 `100 * M` 个 `dist` 做全局排序得到最终 TopN 及其 
`rowid`，最后 Materialize 算子依据这些 `rowid` 在对应 tablet/rowset/segment 上物化所需列。
\ No newline at end of file
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md
index b06d595a502..a7c8b6a5fe1 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md
@@ -93,11 +93,11 @@ select count(*) from sift_1M
 |  1000000 |
 +----------+
 ```
-SIFT 数据集同时发布了一组 ground truth，用于校验结果。下面选取一组向量，先使用精确距离函数进行 TopN 召回：
+使用 `l2_distance_approximate` / `inner_product_approximate` 会触发 ANN 
索引路径。函数名必须与索引的 `metric_type` 完全匹配（例如：`metric_type=l2_distance` → 使用 
`l2_distance_approximate`；`metric_type=inner_product` → 使用 
`inner_product_approximate`）。排序规则：L2 距离使用升序（越小越近）；Inner Product 使用降序（越大越近）。
 
 ```sql
 SELECT id,
-       L2_distance(
+       l2_distance_approximate(
         embedding,
         
[0,11,77,24,3,0,0,0,28,70,125,8,0,0,0,0,44,35,50,45,9,0,0,0,4,0,4,56,18,0,3,9,16,17,59,10,10,8,57,57,100,105,125,41,1,0,6,92,8,14,73,125,29,7,0,5,0,0,8,124,66,6,3,1,63,5,0,1,49,32,17,35,125,21,0,3,2,12,6,109,21,0,0,35,74,125,14,23,0,0,6,50,25,70,64,7,59,18,7,16,22,5,0,1,125,23,1,0,7,30,14,32,4,0,2,2,59,125,19,4,0,0,2,1,6,53,33,2]
        ) AS distance
@@ -120,41 +120,21 @@ LIMIT 10;
 | 803100 | 230.9112 |
 | 866737 | 231.6441 |
 +--------+----------+
+10 rows in set (0.02 sec)
+```
+要与精确的真实结果进行比较，请使用 `l2_distance` 或 `inner_product`（不带 `_approximate` 
后缀）。在此示例中，精确搜索耗时约 290 毫秒：
+```
 10 rows in set (0.29 sec)
 ```
 
-当使用 `l2_distance` 或 `inner_product` 时，Doris 需要计算查询向量与 1,000,000 个候选向量之间的距离，再通过 
TopN 算子得到全局结果。使用 `l2_distance_approximate` / `inner_product_approximate` 
可触发索引执行路径：
-```sql
-SELECT id,
-       l2_distance_approximate(
-        embedding,
-        
[0,11,77,24,3,0,0,0,28,70,125,8,0,0,0,0,44,35,50,45,9,0,0,0,4,0,4,56,18,0,3,9,16,17,59,10,10,8,57,57,100,105,125,41,1,0,6,92,8,14,73,125,29,7,0,5,0,0,8,124,66,6,3,1,63,5,0,1,49,32,17,35,125,21,0,3,2,12,6,109,21,0,0,35,74,125,14,23,0,0,6,50,25,70,64,7,59,18,7,16,22,5,0,1,125,23,1,0,7,30,14,32,4,0,2,2,59,125,19,4,0,0,2,1,6,53,33,2]
-       ) AS distance
-FROM sift_1m
-ORDER BY distance
-LIMIT 10;
---------------
+使用 ANN 索引后，查询延迟从约 290 毫秒降至约 20 毫秒。
 
-+--------+----------+
-| id     | distance |
-+--------+----------+
-| 178811 | 210.1595 |
-| 177646 | 217.0161 |
-| 181997 | 218.5406 |
-| 181605 | 219.2989 |
-| 821938 | 221.7228 |
-| 807785 | 226.7135 |
-| 716433 | 227.3148 |
-| 358802 | 230.7314 |
-| 803100 | 230.9112 |
-| 866737 | 231.6441 |
-+--------+----------+
-10 rows in set (0.02 sec)
-```
-可以看到使用 ANN 索引后，查询耗时从约 290 ms 降至约 20 ms。
-Doris 中，ANN 索引建立在 segment 粒度；由于表是分布式的，各 segment 返回局部 TopN 后，TopN 算子会将多个 tablet 
的结果归并生成全局 TopN。
+ANN 索引以 segment 为粒度构建。在分布式表中，每个 segment 返回其本地 TopN 结果；然后 TopN 算子在 tablet 和 
segment 之间合并结果以产生全局 TopN。
+
+排序说明：
+- 对于 `metric_type = l2_distance`，距离越小表示向量越接近 → 使用 `ORDER BY dist ASC`。
+- 对于 `metric_type = inner_product`，数值越大表示向量越接近 → 使用 `ORDER BY dist DESC` 
通过索引获取 TopN。
 
-需要注意：当 `l2_distance` 作为索引 metric 时，distance 越小表示越接近；`inner_product` 
则相反，值越大越接近。因此若使用 `inner_product`，必须 `ORDER BY dist DESC` 才能通过索引获得 TopN。
 ## 近似范围搜索
 
 除了常见的 TopN 最近邻搜索（即返回与目标向量最近的前 N 条记录）之外，向量检索中还有一类常见的查询方式是 基于距离阈值的范围搜索。
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/behind-index.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/behind-index.md
new file mode 100644
index 00000000000..cb2d2defdca
--- /dev/null
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/behind-index.md
@@ -0,0 +1,251 @@
+---
+{
+    "title": "性能测试背后的优化",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+早期版本的 Apache Doris 是一个面向在线数据分析处理的系统，主要处理的场景是报表分析和聚合数据分析，最典型的查询是多表 JOIN 以及 
GROUP BY 聚合查询。在 2.X 版本中实现了基于倒排索引的文本检索功能，引入了 Variant 数据类型来高效处理 JSON。在 3.X 
版本中引入了存算分离特性使得 Apache Doris 可以利用对象存储极大降低存储成本，而 4.X 版本则是通过引入向量索引使得 Apache Doris 
正式迈入 AI 时代，利用向量搜索与文本搜索提供的混合搜索能力，Doris 将会成为企业的 AI 数据分析核心平台。这里我们会介绍 Doris 在 4.X 
版本中是如何实现向量索引的，以及为了使其性能追上并达到业界先进水平，Doris 做了哪些工作。
+
+我们把向量索引的实现分为两个大部分，第一个部分是索引阶段，索引阶段需要解决的问题是：1. 数据分片；2. 高效构建高质量索引；3. 
索引管理。第二个部分则是查询阶段，查询阶段只有一个核心目标，如何提升查询性能，这其中我们会面临很多问题，比如如何最大程度消除重复计算与多余的磁盘IO，如何优化并发性能等等。
+
+## 索引阶段
+索引阶段的性能和索引的超参数强相关，如果需要一个更高的索引质量，那么势必会导致索引时间变长，得益于 Apache Doris 
在数据导入路径上的优化，Doris 可以在保持高质量索引的同时提高导入的性能。
+
+在 768 维 10M 行的数据规模上进行测试，Apache Doris 的导入性能处于业界先进水平
+
+![alt text](/images/vector-search/image-1.png)
+
+### 多层级分片
+Apache Doris 的内表天然是分布式表。用户在查询或导入时仅感知到一张逻辑表（Table），而 Doris 
内核会依据表定义自动创建满足数量要求的物理表（Tablet），并在导入过程中按分区键与分桶键将数据路由到对应 BE 的 tablet。多个 tablet 
共同组成用户看到的 table。每次导入都会形成一个导入事务，并在对应的 tablet 上生成一个 rowset（用于版本控制的逻辑单位）。每个 rowset 
下包含若干个 segment，真正承载数据的是 segment，ANN 索引也作用于 segment 粒度。
+
+![表到分片层级示意](/images/vector-search/image.png)
+
+向量索引（如 HNSW）依赖多个关键超参数，这些参数直接决定索引质量与查询性能，并通常在固定数据规模下才能达到理想效果。**Apache Doris 
的多层级分片将“索引参数”与“整表数据规模”解耦：用户无需因数据总量增长而重建索引，只需关注每批次的导入规模与相应参数设置。** 基于我们的测试，HNSW 
索引在不同批次规模下的经验参数如下：
+
+| batch_size | max_degree | ef_construction | ef_search | recall@100 |
+|------------|------------|-----------------|-----------|------------|
+| 250000     | 100        | 200             | 50        | 89%        |
+| 250000     | 100        | 200             | 100       | 93%        |
+| 250000     | 100        | 200             | 150       | 95%        |
+| 250000     | 100        | 200             | 200       | 98%        |
+| 500000     | 120        | 240             | 50        | 91%        |
+| 500000     | 120        | 240             | 100       | 94%        |
+| 500000     | 120        | 240             | 150       | 96%        |
+| 500000     | 120        | 240             | 200       | 99%        |
+| 1000000    | 150        | 300             | 50        | 90%        |
+| 1000000    | 150        | 300             | 100       | 93%        |
+| 1000000    | 150        | 300             | 150       | 96%        |
+| 1000000    | 150        | 300             | 200       | 98%        |
+
+换言之，用户只需聚焦“每一批次的导入数据量”，并据此选择合适的索引参数，即可在保证索引质量的同时获得稳定的查询表现。
+
+### 高性能索引构建
+
+#### 并行高质量索引构建
+
+Apache Doris 采用“双层并行”加速索引构建：一方面通过多台 BE 节点实现集群级并行；另一方面在每台 BE 
内，对同一批数据分组进行多线程并行的距离计算，以提升索引数据结构的构建速度。在“快”的同时，Doris 
通过内存赞批提升索引质量：当总向量数固定但分批过细、频繁追加索引时，图结构容易稀疏、召回率下降。例如对 768D10M 的向量，分 10 次构建索引可达约 
99% 召回，若改为分 100 次则可能降至约 95%。通过内存赞批，在相同超参数下可更好地平衡内存占用与图质量，避免因过度分批导致的质量劣化。
+
+#### SIMD
+
+ANN 索引构建的核心成本在大规模距离计算，属于典型 CPU 密集型任务。Apache Doris 将这部分计算集中在 BE 节点，相关实现均以 C++ 
编写，并充分利用 Faiss 的自动与手动向量化优化。以 L2 距离为例，Faiss 通过编译器辅导宏触发自动向量化，代码示例如下：
+```cpp
+FAISS_PRAGMA_IMPRECISE_FUNCTION_BEGIN
+float fvec_L2sqr(const float* x, const float* y, size_t d) {
+    size_t i;
+    float res = 0;
+    FAISS_PRAGMA_IMPRECISE_LOOP
+    for (i = 0; i < d; i++) {
+        const float tmp = x[i] - y[i];
+        res += tmp * tmp;
+    }
+    return res;
+}
+FAISS_PRAGMA_IMPRECISE_FUNCTION_END
+```
+上述 `FAISS_PRAGMA_IMPRECISE_*` 宏可引导编译器进行自动向量化：
+```cpp
+#define FAISS_PRAGMA_IMPRECISE_LOOP \
+    _Pragma("clang loop vectorize(enable) interleave(enable)")
+```
+同时，Faiss 在 `#ifdef SSE3/AVX2/AVX512F` 条件编译块中使用 `_mm*`/`_mm256*`/`_mm512*` 
指令进行显式向量化；结合模板 `ElementOpL2/ElementOpIP` 与维度特化 `fvec_op_ny_D{1,2,4,8,12}`，实现：
+- 批量处理多条样本（如 8/16），并通过寄存器内矩阵转置提升访问连续性；
+- 使用 FMA 指令（如 `_mm512_fmadd_ps`）合并乘加以减少指令数；
+- 通过水平求和（horizontal sum）快速得到标量结果；
+- 以 masked 分支处理非 4/8/16 对齐的尾元素。
+这些优化有效压缩距离计算的指令与访存开销，显著提升索引构建吞吐。
+
+## 查询阶段
+
+搜索场景对延迟极为敏感。在千万级数据量与高并发查询的场景下，通常需要将 P99 延迟控制在 500 ms 以内。这对 Doris 
的优化器、执行引擎以及索引实现都提出了更高要求。开箱即用的测试表明，Apache Doris 的查询性能已达到业界主流专用向量数据库的水平。下图展示了 
Apache Doris 与其他具备向量搜索能力的数据库在 Performance768D10M 数据集上的对比；其他数据库数据来自 Zilliz 开源的 
[VectorDBBench](https://github.com/zilliztech/VectorDBBench) 框架。
+
+![alt text](/images/vector-search/image-2.png)
+
+> 注：图中仅包含部分数据库的开箱测试结果。OpenSearch 与 Elastic Cloud 可通过优化索引文件数量进一步提升查询性能。
+
+### Prepare Statement
+在传统执行路径中，Doris 会对每条 SQL 执行完整优化流程（语法解析、语义分析、RBO、CBO）。这在通用 OLAP 
场景必不可少，但在搜索等简单且高度重复的查询模式中会产生明显的额外开销。为此，Doris 4.0 扩展了 Prepare 
Statement，使其不仅支持点查，也适用于包含向量检索在内的所有 SQL 类型。核心思路如下：
+1. 分离编译与执行
+    - Prepare 阶段一次性完成解析、语义与优化，生成可复用的逻辑计划（Logical Plan）。
+    - Execute 阶段仅绑定实参并直接执行已生成的计划，完全跳过优化器。
+2. 计划缓存（Plan Cache）
+    - 按 SQL 指纹（normalized SQL + schema version）判断计划是否可复用。
+    - 参数值不同但结构一致时仍可直接复用，避免重复优化。
+3. Schema Version 校验
+    - 执行时校验表结构版本，确保计划正确性。
+    - schema 未变化 → 直接复用；已变化 → 自动失效并重新 Prepare。
+4. 跳过优化器带来显著加速
+    - Execute 不再运行 RBO/CBO，优化器耗时几乎被完全消除。
+    - 在向量检索这类模板化查询中，Prepare 可显著降低端到端延迟。
+
+### Index Only Scan
+Apache Doris 的向量索引采用外挂方式。外挂索引便于管理与异步构建，但也带来性能挑战：如何避免重复计算与多余 IO。ANN 
索引除返回命中行号外，还可返回向量间距离。为高效利用这些额外信息，执行引擎在 Scan 算子阶段对距离相关表达式进行“提前短路”。Doris 
通过“虚拟列”机制自动完成该短路，并以 Ann Index Only Scan 完全消除与距离计算相关的读 IO。
+在朴素流程中，Scan 将谓词下推至索引，索引返回行号；随后 Scan 按行号读取数据页（Data Page），再计算表达式并向上返回 N 行结果。
+
+![alt text](/images/vector-search/image-3.png)
+
+应用 Index Only Scan 后，流程变为：
+
+![alt text](/images/vector-search/image-4.png)
+
+例如 `SELECT l2_distance_approximate(embedding, [...]) AS dist FROM tbl ORDER BY 
dist LIMIT 100;`，执行过程将不再触发数据文件 IO。
+
+除 Ann TopN Search 外，支持索引加速的 Range Search 与复合检索（Compound Search）也采用类似优化。Range 
Search 较 TopN 更复杂：不同比较方式决定索引是否能返回 dist。以下梳理与 Ann Index Only Scan 相关的查询类型及其是否可被 
Index Scan 优化：
+
+```SQL
+-- Sql1
+-- Range + proj
+-- Ann 索引可以返回 dist，所以 dist 不需要再次计算
+-- 同时 virtual column for cse 的优化避免了 proj 里面的 dist 计算
+-- IndexScan: True
+select id, dist(embedding, [...]) from tbl where dist <= 10;
+
+-- Sql2
+-- Range + no-proj
+-- Ann 索引可以返回 dist，所以 dist 不需要再次计算
+-- IndexScan: True
+select id from tbl where dist <= 10 order by id limit N;
+
+-- Sql3
+-- Range + proj + no-dist-from index
+-- Ann 索引无法返回 dist(索引只能更新 rowid map)
+-- 由于 proj 里面要求返回 dist 因此 embedding 需要重读
+-- IndexScan: False
+select id, dist(embedding, [...]) from tbl where dist > 10;
+
+-- Sql4
+-- Range + proj + no-dist-from index
+-- Ann 索引无法返回 dist(索引只能更新 rowid map)
+-- 但是 proj 里面不需要 dist，因此 embedding 不需要重新读
+-- IndexScan: True
+select id from tbl where dist > 10;
+
+-- Sql5
+-- TopN
+-- AnnIndex 返回 dist，virtual slot for cse 确保了索引的 dist 被上传到 proj
+-- 因此不需要读 embedding 列
+-- IndexScan: True
+select id[, dist(embedding, [...])] from tbl order by dist(embedding, [...]) 
asc limit N;
+
+-- Sql6
+-- TopN + IndexFilter
+-- 1. comment 列不需要读，inverted index scan 已经做了这个优化
+-- 2. embedding 列不需要读，原因与 sql5 一样
+-- IndexScan: True
+select id[, dist(embedding, [...])] from tbl where comment match_any 'olap' 
ORDER BY dist(embedding, [...]) LIMIT N;
+
+-- Sql7
+-- TopN + Range
+-- IndexScan:True，原因是 Sql1 与 Sql5 组合
+select id[, dist(embedding, [...])] from tbl where dist(embedding, [...]) > 10 
order by dist(embedding, [...]) limit N;
+
+-- Sql8
+-- TopN + Range + IndexFilter
+-- INdexScan:True，原因是 Sql7 与 Sql6 组合
+select id[, dist(embedding, [...])] from tbl where comment match_any 'olap' 
and dist(embedding, [...]) > 10 ORDER BY dist(embedding, [...]) LIMIT N;
+
+-- Sql9
+-- TopN + Range + CommonFilter
+-- 这里重点： 1. dist < 10 而不是 dist > 10; 2. common filter 没有直接读 embedding，而是读的 dist
+-- Ann index 可以返回 dist，virtual slot ref for cse 确保了所有对 dist 的读都是同一个列
+-- 此时虽然 ann topn 无法 apply，理论上 embedding 列依然全程不需要物化
+-- 但是实际中，依然还是会物化 embedding，因为目前判断某个列是否可以 skip reading，是靠判断这个列上的谓词是否还有残留，common 
filter 本身无法被消除，所以现在代码上是会判断需要物化的。
+-- 这个优化点ROI不高，因此不做了
+-- IndexScan: False 
+select id[，dist(embedding, [...])] from tbl where where comment match_any 
'olap' and dist(embedding, [...]) < 10 AND abs(dist(embedding) + 10) > 10 ORDER 
BY dist(embedding, [...]) LIMIT N;
+
+-- Sql10
+-- Sql9 的变种，dist < 10 变成了 dist > 10，此时 index 无法返回 embedding
+-- 因此为了计算 abs(dist(embedding, [...]) 需要物化 embedding
+-- IndexScan: False
+select id[，dist(embedding, [...])] from tbl where where comment match_any 
'olap' and dist(embedding, [...]) > 10 AND abs(dist(embedding) + 10) > 10 ORDER 
BY dist(embedding, [...]) LIMIT N;
+
+-- Sql11
+-- Sql9 的变种，abs(dist(embedding) + 10) > 10 变成了 array_size(embedding) > 10，区别在于 
array_size 强制要求 embedding 的物化
+-- 为了计算 array_size(embedding, [...]) 需要物化 embedding
+-- IndexScan: False
+select id[，dist(embedding, [...])] from tbl where where comment match_any 
'olap' and dist(embedding, [...]) < 10 AND array_size(embedding) > 10 ORDER BY 
dist(embedding, [...]) LIMIT N;
+```
+
+### 虚拟列优化公共子表达式
+
+Index Only Scan 主要解决 IO 问题，避免了对 embedding 的大量随机读。为进一步消除重复计算，Doris 
在计算层引入“虚拟列”机制，将索引返回的 dist 以列形式传递给表达式执行器。
+虚拟列的设计要点：
+1. 引入表达式节点 `VirtualSlotRef`；
+2. 引入列迭代器 `VirtualColumnIterator`。
+
+`VirtualSlotRef` 表示“计算时生成”的特殊列，由某个表达式物化且可被多个表达式共享，仅首次使用时计算一次，从而消除 Projection 
与谓词中的公共子表达式（CSE）重复计算。`VirtualColumnIterator` 
用于将索引返回的距离物化到表达式，避免重复的距离函数计算。该机制最初用于 ANN 相关查询的 CSE 消除，随后扩展至通用的 Projection + 
Scan + Filter 组合。基于 ClickBench 数据集，以下查询统计从 Google 获得最多点击的 20 个网站：
+```sql
+set experimental_enable_virtual_slot_for_cse=true;
+
+SELECT counterid,
+       COUNT(*)               AS hit_count,
+       COUNT(DISTINCT userid) AS unique_users
+FROM   hits
+WHERE  ( UPPER(regexp_extract(referer, '^https?://([^/]+)', 1)) = 'GOOGLE.COM'
+         OR UPPER(regexp_extract(referer, '^https?://([^/]+)', 1)) = 
'GOOGLE.RU'
+         OR UPPER(regexp_extract(referer, '^https?://([^/]+)', 1)) LIKE 
'%GOOGLE%' )
+       AND ( LENGTH(regexp_extract(referer, '^https?://([^/]+)', 1)) > 3
+              OR regexp_extract(referer, '^https?://([^/]+)', 1) != ''
+              OR regexp_extract(referer, '^https?://([^/]+)', 1) IS NOT NULL )
+       AND eventdate = '2013-07-15'
+GROUP  BY counterid
+HAVING hit_count > 100
+ORDER  BY hit_count DESC
+LIMIT  20;
+```
+核心表达式 `regexp_extract(referer, '^https?://([^/]+)', 1)` 为 CPU 
密集型且被多处复用。启用虚拟列优化（`set experimental_enable_virtual_slot_for_cse=true;`）后：
+- 开启优化：0.57 s
+- 关闭优化：1.50 s
+
+端到端性能提升约 3 倍。
+
+### Scan 并行度优化
+Doris 针对 Ann TopN Search 重构了 Scan 并行策略。原策略按“行数”决定并行度（默认 2,097,152 行对应 1 个 Scan 
Task）。由于 segment 基于 size 创建，高维向量列会使单 segment 行数远低于该阈值，导致一个 Scan Task 内出现多个 
segment 串行扫描、进而影响性能。Doris 改为“严格按 segment 创建 Scan Task”，提升索引分析阶段的并行度；由于 Ann TopN 
Search 的过滤率极高（只返回 N 行），回表阶段即便串行也不影响整体性能。以 SIFT 1M 为例：`set 
optimize_index_scan_parallelism=true;`开启后 TopN 串行查询耗时由 230 ms 降至 50 ms。
+此外，4.0 引入“动态并行度调整”：每轮调度前根据 Scan 
线程池压力动态决定可提交的任务数；压力大则减并行、资源空闲则增并行，以在串行与高并发场景间兼顾资源利用率与调度开销。
+### 全局 TopN 延迟物化
+典型的 Ann TopN 查询包含两阶段：
+1. Scan 算子通过索引获取各 segment 的 TopN 距离；
+2. 全局排序节点对各 segment 的 TopN 进行合并排序，得到最终 TopN。
+
+若 projection 返回多列或包含大列（如 String），阶段一从每个 segment 读取的 N 行可能造成大量磁盘 
IO，且在阶段二的全局排序中被丢弃（非最终 TopN）。Doris 通过“全局 TopN 延迟物化”最大限度减少阶段一读取量。
+以 `SELECT id, l2_distance_approximate(embedding, [...]) AS dist FROM tbl ORDER 
BY dist LIMIT 100;` 为例：阶段一每个 segment 通过 Ann Index Only Scan + 虚拟列仅输出 100 个 
`dist` 及其 `rowid`；若共有 M 个 segment，阶段二对 `100 * M` 个 `dist` 做全局排序得到最终 TopN 及其 
`rowid`，最后 Materialize 算子依据这些 `rowid` 在对应 tablet/rowset/segment 上物化所需列。
\ No newline at end of file
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md
index 39b1da157d3..1bccdf35b66 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md
@@ -1,6 +1,7 @@
 ---
 {
-    "title": "向量搜索概览",
+    "title": "向量搜索",
+    "sidebar_label": "概述",
     "language": "zh-CN"
 }
 ---
@@ -92,11 +93,11 @@ select count(*) from sift_1M
 |  1000000 |
 +----------+
 ```
-SIFT 数据集同时发布了一组 ground truth，用于校验结果。下面选取一组向量，先使用精确距离函数进行 TopN 召回：
+使用 `l2_distance_approximate` / `inner_product_approximate` 会触发 ANN 
索引路径。函数名必须与索引的 `metric_type` 完全匹配（例如：`metric_type=l2_distance` → 使用 
`l2_distance_approximate`；`metric_type=inner_product` → 使用 
`inner_product_approximate`）。排序规则：L2 距离使用升序（越小越近）；Inner Product 使用降序（越大越近）。
 
 ```sql
 SELECT id,
-       L2_distance(
+       l2_distance_approximate(
         embedding,
         
[0,11,77,24,3,0,0,0,28,70,125,8,0,0,0,0,44,35,50,45,9,0,0,0,4,0,4,56,18,0,3,9,16,17,59,10,10,8,57,57,100,105,125,41,1,0,6,92,8,14,73,125,29,7,0,5,0,0,8,124,66,6,3,1,63,5,0,1,49,32,17,35,125,21,0,3,2,12,6,109,21,0,0,35,74,125,14,23,0,0,6,50,25,70,64,7,59,18,7,16,22,5,0,1,125,23,1,0,7,30,14,32,4,0,2,2,59,125,19,4,0,0,2,1,6,53,33,2]
        ) AS distance
@@ -119,41 +120,21 @@ LIMIT 10;
 | 803100 | 230.9112 |
 | 866737 | 231.6441 |
 +--------+----------+
+10 rows in set (0.02 sec)
+```
+要与精确的真实结果进行比较，请使用 `l2_distance` 或 `inner_product`（不带 `_approximate` 
后缀）。在此示例中，精确搜索耗时约 290 毫秒：
+```
 10 rows in set (0.29 sec)
 ```
 
-当使用 `l2_distance` 或 `inner_product` 时，Doris 需要计算查询向量与 1,000,000 个候选向量之间的距离，再通过 
TopN 算子得到全局结果。使用 `l2_distance_approximate` / `inner_product_approximate` 
可触发索引执行路径：
-```sql
-SELECT id,
-       l2_distance_approximate(
-        embedding,
-        
[0,11,77,24,3,0,0,0,28,70,125,8,0,0,0,0,44,35,50,45,9,0,0,0,4,0,4,56,18,0,3,9,16,17,59,10,10,8,57,57,100,105,125,41,1,0,6,92,8,14,73,125,29,7,0,5,0,0,8,124,66,6,3,1,63,5,0,1,49,32,17,35,125,21,0,3,2,12,6,109,21,0,0,35,74,125,14,23,0,0,6,50,25,70,64,7,59,18,7,16,22,5,0,1,125,23,1,0,7,30,14,32,4,0,2,2,59,125,19,4,0,0,2,1,6,53,33,2]
-       ) AS distance
-FROM sift_1m
-ORDER BY distance
-LIMIT 10;
---------------
+使用 ANN 索引后，查询延迟从约 290 毫秒降至约 20 毫秒。
 
-+--------+----------+
-| id     | distance |
-+--------+----------+
-| 178811 | 210.1595 |
-| 177646 | 217.0161 |
-| 181997 | 218.5406 |
-| 181605 | 219.2989 |
-| 821938 | 221.7228 |
-| 807785 | 226.7135 |
-| 716433 | 227.3148 |
-| 358802 | 230.7314 |
-| 803100 | 230.9112 |
-| 866737 | 231.6441 |
-+--------+----------+
-10 rows in set (0.02 sec)
-```
-可以看到使用 ANN 索引后，查询耗时从约 290 ms 降至约 20 ms。
-Doris 中，ANN 索引建立在 segment 粒度；由于表是分布式的，各 segment 返回局部 TopN 后，TopN 算子会将多个 tablet 
的结果归并生成全局 TopN。
+ANN 索引以 segment 为粒度构建。在分布式表中，每个 segment 返回其本地 TopN 结果；然后 TopN 算子在 tablet 和 
segment 之间合并结果以产生全局 TopN。
+
+排序说明：
+- 对于 `metric_type = l2_distance`，距离越小表示向量越接近 → 使用 `ORDER BY dist ASC`。
+- 对于 `metric_type = inner_product`，数值越大表示向量越接近 → 使用 `ORDER BY dist DESC` 
通过索引获取 TopN。
 
-需要注意：当 `l2_distance` 作为索引 metric 时，distance 越小表示越接近；`inner_product` 
则相反，值越大越接近。因此若使用 `inner_product`，必须 `ORDER BY dist DESC` 才能通过索引获得 TopN。
 ## 近似范围搜索
 
 除了常见的 TopN 最近邻搜索（即返回与目标向量最近的前 N 条记录）之外，向量检索中还有一类常见的查询方式是 基于距离阈值的范围搜索。
diff --git a/sidebars.ts b/sidebars.ts
index db5ef9a8caa..a1e93f377bd 100644
--- a/sidebars.ts
+++ b/sidebars.ts
@@ -414,6 +414,7 @@ const sidebars: SidebarsConfig = {
                                 'ai/vector-search/ivf',
                                 'ai/vector-search/index-management',
                                 'ai/vector-search/performance',
+                                'ai/vector-search/behind-index',
                             ],
                         },
                     ],
diff --git a/static/images/vector-search/image-1.png 
b/static/images/vector-search/image-1.png
new file mode 100644
index 00000000000..c48f66aac65
Binary files /dev/null and b/static/images/vector-search/image-1.png differ
diff --git a/static/images/vector-search/image-2.png 
b/static/images/vector-search/image-2.png
new file mode 100644
index 00000000000..63b33c72caf
Binary files /dev/null and b/static/images/vector-search/image-2.png differ
diff --git a/static/images/vector-search/image-3.png 
b/static/images/vector-search/image-3.png
new file mode 100644
index 00000000000..8920cef3cc2
Binary files /dev/null and b/static/images/vector-search/image-3.png differ
diff --git a/static/images/vector-search/image-4.png 
b/static/images/vector-search/image-4.png
new file mode 100644
index 00000000000..10cdaca54a5
Binary files /dev/null and b/static/images/vector-search/image-4.png differ
diff --git a/static/images/vector-search/image.png 
b/static/images/vector-search/image.png
new file mode 100644
index 00000000000..069a4c5099c
Binary files /dev/null and b/static/images/vector-search/image.png differ
diff --git a/versioned_docs/version-4.x/ai/vector-search/behind-index.md 
b/versioned_docs/version-4.x/ai/vector-search/behind-index.md
new file mode 100644
index 00000000000..1dd1b0b3801
--- /dev/null
+++ b/versioned_docs/version-4.x/ai/vector-search/behind-index.md
@@ -0,0 +1,240 @@
+---
+{
+    "title": "Optimizations Behind Performance",
+    "language": "en"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+Early versions of Apache Doris focused on online analytical processing (OLAP), 
primarily for reporting and aggregation workloads—typical queries being 
multi-table JOIN and GROUP BY. In 2.x, Doris added text search via inverted 
indexes and introduced the Variant type for efficient JSON handling. In 3.x, 
storage-compute separation enabled leveraging object storage to significantly 
reduce storage costs. In 4.x, Doris steps into the AI era by introducing vector 
indexes and hybrid search (vec [...]
+
+We divide vector indexing into two stages: indexing and querying. The indexing 
stage focuses on 1) data sharding, 2) efficiently building high-quality 
indexes, and 3) index management. The querying stage has a single goal: improve 
query performance—eliminating redundant computation and unnecessary IO while 
optimizing concurrency.
+
+## Indexing Stage
+Indexing performance is strongly tied to index hyperparameters: higher index 
quality typically means longer build time. Thanks to optimizations in the 
ingestion path, Doris can maintain high index quality while improving ingestion 
throughput.
+
+On the 768D 10M dataset, Apache Doris achieves industry-leading ingestion 
performance.
+
+![alt text](/images/vector-search/image-1.png)
+
+### Multi-Level Sharding
+Internal tables in Apache Doris are inherently distributed. During query and 
ingestion, users interact with a single logical table, while the Doris kernel 
creates the required number of physical tablets based on the table definition. 
During ingestion, data is routed to the appropriate BE tablet by partition and 
bucket keys. Multiple tablets together form the logical table seen by users. 
Each ingestion request forms a transaction, creating a rowset (versioning unit) 
on the corresponding t [...]
+
+![Hierarchy from table to shards](/images/vector-search/image.png)
+
+Vector indexes (e.g., HNSW) rely on key hyperparameters that directly 
determine index quality and query performance, and are typically tuned for 
specific data scales. Apache Doris’s multi-level sharding decouples “index 
parameters” from the “full table data scale”: users need not rebuild indexes as 
total data grows, but only tune parameters based on per-batch ingestion size. 
From our tests, HNSW suggested parameters under different batch sizes are:
+
+| batch_size | max_degree | ef_construction | ef_search | recall@100 |
+|------------|------------|-----------------|-----------|------------|
+| 250000     | 100        | 200             | 50        | 89%        |
+| 250000     | 100        | 200             | 100       | 93%        |
+| 250000     | 100        | 200             | 150       | 95%        |
+| 250000     | 100        | 200             | 200       | 98%        |
+| 500000     | 120        | 240             | 50        | 91%        |
+| 500000     | 120        | 240             | 100       | 94%        |
+| 500000     | 120        | 240             | 150       | 96%        |
+| 500000     | 120        | 240             | 200       | 99%        |
+| 1000000    | 150        | 300             | 50        | 90%        |
+| 1000000    | 150        | 300             | 100       | 93%        |
+| 1000000    | 150        | 300             | 150       | 96%        |
+| 1000000    | 150        | 300             | 200       | 98%        |
+
+In short, focus on “per-batch ingestion size” and choose proper index 
parameters to maintain quality and stable query behavior.
+
+### High-Performance Index Building
+
+#### Parallel, High-Quality Index Construction
+
+Doris accelerates index builds with two-level parallelism: cluster-level 
parallelism across BE nodes, and intra-node multithreaded distance computation 
on grouped batch data. Beyond speed, Doris improves index quality via in-memory 
batching: when the total vector count is fixed but batching is too fine 
(frequent incremental builds), graph structures become sparser and recall 
drops. For example, on 768D10M, building in 10 batches may reach ~99% recall, 
while 100 batches may drop to ~95%.  [...]
+
+#### SIMD
+
+The core cost in ANN index building is large-scale distance computation—a 
CPU-bound workload. Doris centralizes this work on BE nodes, implemented in 
C++, and leverages Faiss’s automatic and manual vectorization optimizations. 
For L2 distance, Faiss uses compiler pragmas to trigger auto-vectorization:
+```cpp
+FAISS_PRAGMA_IMPRECISE_FUNCTION_BEGIN
+float fvec_L2sqr(const float* x, const float* y, size_t d) {
+    size_t i; float res = 0;
+    FAISS_PRAGMA_IMPRECISE_LOOP
+    for (i = 0; i < d; i++) {
+        const float tmp = x[i] - y[i];
+        res += tmp * tmp;
+    }
+    return res;
+}
+FAISS_PRAGMA_IMPRECISE_FUNCTION_END
+```
+With `FAISS_PRAGMA_IMPRECISE_*`, compilers auto-vectorize:
+```cpp
+#define FAISS_PRAGMA_IMPRECISE_LOOP \
+    _Pragma("clang loop vectorize(enable) interleave(enable)")
+```
+Faiss also applies explicit SIMD in `#ifdef SSE3/AVX2/AVX512F` blocks using 
`_mm*`/`_mm256*`/`_mm512*`, combined with `ElementOpL2/ElementOpIP` and 
dimension-specialized `fvec_op_ny_D{1,2,4,8,12}` to:
+- Process multiple samples per iteration (e.g., 8/16) and perform 
register-level transpose to improve memory access locality;
+- Use FMA (e.g., `_mm512_fmadd_ps`) to fuse multiply-add and reduce 
instruction count;
+- Do horizontal sums to produce scalars efficiently;
+- Handle tail elements via masked reads for non-aligned sizes.
+These optimizations reduce instruction and memory costs and significantly 
boost indexing throughput.
+
+## Querying Stage
+
+Search is latency sensitive. At tens of millions of records with high 
concurrency, P99 latency typically needs to be under 500 ms—raising the bar for 
the optimizer, execution engine, and index implementation. Out-of-the-box tests 
show Doris reaches performance comparable to mainstream dedicated vector 
databases. The chart below compares Doris against other systems on 
Performance768D10M; peer data comes from Zilliz’s open-source 
[VectorDBBench](https://github.com/zilliztech/VectorDBBench).
+
+![alt text](/images/vector-search/image-2.png)
+
+> Note: The chart includes a subset of out-of-the-box results. OpenSearch and 
Elastic Cloud can improve query performance by optimizing the number of index 
files.
+
+### Prepare Statement
+In the traditional path, Doris runs full optimization (parsing, semantic 
analysis, RBO, CBO) for every SQL. While essential for general OLAP, this adds 
overhead for simple, highly repetitive search queries. Doris 4.0 extends 
Prepare Statement beyond point lookups to all SQL types, including vector 
search:
+1. Separate compile and execute
+   - Prepare performs parsing, semantics, and optimization once, producing a 
reusable Logical Plan.
+   - Execute binds parameters at runtime and runs the pre-built plan, skipping 
the optimizer entirely.
+2. Plan cache
+   - Reuse is determined by SQL fingerprint (normalized SQL + schema version).
+   - Different parameter values with the same structure reuse the cached plan, 
avoiding re-optimization.
+3. Schema version check
+   - Validate schema version at execution to ensure correctness.
+   - No change → reuse; changed → invalidate and re-prepare.
+4. Speedup by skipping optimizer
+   - Execute no longer runs RBO/CBO; optimizer time is nearly eliminated.
+   - Template-heavy vector queries benefit with significantly lower end-to-end 
latency.
+
+### Index Only Scan
+Doris implements vector indexes as external (pluggable) indexes, which 
simplifies management and supports asynchronous builds, but introduces 
performance challenges such as avoiding redundant computation and IO. ANN 
indexes can return distances in addition to row IDs. Doris leverages this by 
short-circuiting distance expressions within the Scan operator via “virtual 
columns,” and the Ann Index Only Scan fully eliminates distance-related read IO.
+In the naive flow, Scan pushes predicates to the index, the index returns row 
IDs, and Scan then reads data pages and computes expressions before returning N 
rows upstream.
+
+![alt text](/images/vector-search/image-3.png)
+
+With Index Only Scan applied, the flow becomes:
+
+![alt text](/images/vector-search/image-4.png)
+
+For example, `SELECT l2_distance_approximate(embedding, [...]) AS dist FROM 
tbl ORDER BY dist LIMIT 100;` executes without touching data files.
+
+Beyond Ann TopN Search, Range Search and Compound Search adopt similar 
optimizations. Range Search is more nuanced: whether the index returns `dist` 
depends on the comparator. Below lists query types related to Ann Index Only 
Scan and whether Index Scan applies:
+
+```SQL
+-- Sql1: Range + proj
+-- Index returns dist; no need to recompute dist
+-- Virtual column for CSE avoids dist recomputation in proj
+-- IndexScan: True
+select id, dist(embedding, [...]) from tbl where dist <= 10;
+
+-- Sql2: Range + no-proj
+-- Index returns dist; no need to recompute
+-- IndexScan: True
+select id from tbl where dist <= 10 order by id limit N;
+
+-- Sql3: Range + proj + no-dist-from index
+-- Index cannot return dist (only updates rowid map)
+-- proj requires dist → embedding must be reread
+-- IndexScan: False
+select id, dist(embedding, [...]) from tbl where dist > 10;
+
+-- Sql4: Range + proj + no-dist-from index
+-- Index cannot return dist, but proj does not need dist → embedding not reread
+-- IndexScan: True
+select id from tbl where dist > 10;
+
+-- Sql5: TopN
+-- Index returns dist; virtual slot for CSE uploads dist to proj
+-- embedding column not read
+-- IndexScan: True
+select id[, dist(embedding, [...])] from tbl order by dist(embedding, [...]) 
asc limit N;
+
+-- Sql6: TopN + IndexFilter
+-- 1) comment not read (inverted index already optimizes this)
+-- 2) embedding not read (same reason as Sql5)
+-- IndexScan: True
+select id[, dist(embedding, [...])] from tbl where comment match_any 'olap' 
ORDER BY dist(embedding, [...]) LIMIT N;
+
+-- Sql7: TopN + Range
+-- IndexScan: True (combination of Sql1 and Sql5)
+select id[, dist(embedding, [...])] from tbl where dist(embedding, [...]) > 10 
order by dist(embedding, [...]) limit N;
+
+-- Sql8: TopN + Range + IndexFilter
+-- IndexScan: True (combination of Sql7 and Sql6)
+select id[, dist(embedding, [...])] from tbl where comment match_any 'olap' 
and dist(embedding, [...]) > 10 ORDER BY dist(embedding, [...]) LIMIT N;
+
+-- Sql9: TopN + Range + CommonFilter
+-- Key points: 1) dist < 10 (not > 10); 2) common filter reads dist, not 
embedding
+-- Index returns dist; virtual slot for CSE ensures all reads refer to the 
same column
+-- In theory embedding need not materialize; in practice it still does due to 
residual predicates on the column
+-- IndexScan: False
+select id[, dist(embedding, [...])] from tbl where comment match_any 'olap' 
and dist(embedding, [...]) < 10 AND abs(dist(embedding) + 10) > 10 ORDER BY 
dist(embedding, [...]) LIMIT N;
+
+-- Sql10: Variant of Sql9, dist < 10 → dist > 10
+-- Index cannot return embedding; to compute abs(dist(embedding,...)) 
embedding must materialize
+-- IndexScan: False
+select id[, dist(embedding, [...])] from tbl where comment match_any 'olap' 
and dist(embedding, [...]) > 10 AND abs(dist(embedding) + 10) > 10 ORDER BY 
dist(embedding, [...]) LIMIT N;
+
+-- Sql11: Variant of Sql9, abs(dist(...)+10) > 10 → array_size(embedding) > 10
+-- array_size requires embedding materialization
+-- IndexScan: False
+select id[, dist(embedding, [...])] from tbl where comment match_any 'olap' 
and dist(embedding, [...]) < 10 AND array_size(embedding) > 10 ORDER BY 
dist(embedding, [...]) LIMIT N;
+```
+
+### Virtual Columns for CSE
+
+Index Only Scan mainly eliminates IO (random reads of embedding). To further 
remove redundant computation, Doris introduces virtual columns that pass 
index-returned `dist` into the expression engine as a column.
+Design highlights:
+1. Expression node `VirtualSlotRef`
+2. Column iterator `VirtualColumnIterator`
+
+`VirtualSlotRef` is a compute-time-generated column: materialized by one 
expression, reusable by many, computed once on first use—eliminating CSE across 
Projection and predicates. `VirtualColumnIterator` materializes index-returned 
distances into expressions, avoiding repeated distance calculations. Initially 
built for ANN query CSE elimination, the mechanism was generalized to 
Projection + Scan + Filter.
+Using the ClickBench dataset, the query below counts the top 20 websites by 
Google clicks:
+```sql
+set experimental_enable_virtual_slot_for_cse=true;
+
+SELECT counterid,
+       COUNT(*)               AS hit_count,
+       COUNT(DISTINCT userid) AS unique_users
+FROM   hits
+WHERE  ( UPPER(regexp_extract(referer, '^https?://([^/]+)', 1)) = 'GOOGLE.COM'
+         OR UPPER(regexp_extract(referer, '^https?://([^/]+)', 1)) = 
'GOOGLE.RU'
+         OR UPPER(regexp_extract(referer, '^https?://([^/]+)', 1)) LIKE 
'%GOOGLE%' )
+       AND ( LENGTH(regexp_extract(referer, '^https?://([^/]+)', 1)) > 3
+              OR regexp_extract(referer, '^https?://([^/]+)', 1) != ''
+              OR regexp_extract(referer, '^https?://([^/]+)', 1) IS NOT NULL )
+       AND eventdate = '2013-07-15'
+GROUP  BY counterid
+HAVING hit_count > 100
+ORDER  BY hit_count DESC
+LIMIT  20;
+```
+The core expression `regexp_extract(referer, '^https?://([^/]+)', 1)` is 
CPU-intensive and reused across predicates. With virtual columns enabled (`set 
experimental_enable_virtual_slot_for_cse=true;`):
+- Enabled: 0.57 s
+- Disabled: 1.50 s
+
+End-to-end performance improves by ~3x.
+
+### Scan Parallelism Optimization
+Doris revamped Scan parallelism for Ann TopN Search. The original policy set 
parallelism by row count (default: 2,097,152 rows per Scan Task). Because 
segments are size-based, high-dimensional vector columns produce far fewer rows 
per segment, leading to multiple segments being scanned serially within one 
Scan Task. Doris switched to “one Scan Task per segment,” boosting parallelism 
in index scanning; given Ann TopN’s high filter rate (only N rows returned), 
the back-to-table phase can r [...]
+`set optimize_index_scan_parallelism=true;`
+TopN single-threaded query latency drops from 230 ms to 50 ms.
+Additionally, 4.0 introduces dynamic parallelism: before each scheduling 
round, Doris adjusts the number of submitted Scan tasks based on thread-pool 
pressure—reducing tasks under high load, increasing when idle—to balance 
resource use and scheduling overhead across serial and concurrent workloads.
+
+### Global TopN Delayed Materialization
+A typical Ann TopN query executes in two stages:
+1. Scan obtains per-segment TopN distances via the index;
+2. Global sort merges per-segment TopN to produce the final TopN.
+
+If the projection returns many columns or large types (e.g., String), stage-1 
reading N rows from each segment can incur heavy IO—and many rows are discarded 
during stage-2 global sort. Doris minimizes stage-1 IO via global TopN delayed 
materialization.
+For `SELECT id, l2_distance_approximate(embedding, [...]) AS dist FROM tbl 
ORDER BY dist LIMIT 100;`: stage-1 outputs only 100 `dist` values and rowids 
per segment via Ann Index Only Scan + virtual columns. With M segments, stage-2 
globally sorts `100 * M` `dist` values to obtain the final TopN and rowids, 
then the Materialize operator fetches the needed columns by rowid from 
corresponding tablet/rowset/segment.
\ No newline at end of file
diff --git a/versioned_docs/version-4.x/ai/vector-search/overview.md 
b/versioned_docs/version-4.x/ai/vector-search/overview.md
index 5ecfbcf5fb5..4a35528df09 100644
--- a/versioned_docs/version-4.x/ai/vector-search/overview.md
+++ b/versioned_docs/version-4.x/ai/vector-search/overview.md
@@ -100,12 +100,11 @@ SELECT count(*) FROM sift_1M
 |  1000000 |
 +----------+
 ```
-
-The SIFT dataset ships with a ground-truth set for result validation. Pick one 
query vector and first run an exact Top-N using the precise distance:
+Using `l2_distance_approximate` / `inner_product_approximate` triggers the ANN 
index path. The function must match the index `metric_type` exactly (e.g., 
`metric_type=l2_distance` → use `l2_distance_approximate`; 
`metric_type=inner_product` → use `inner_product_approximate`). For ordering: 
L2 uses ascending distance (smaller is closer); inner product uses descending 
score (larger is closer).
 
 ```sql
 SELECT id,
-       L2_distance(
+       l2_distance_approximate(
         embedding,
         
[0,11,77,24,3,0,0,0,28,70,125,8,0,0,0,0,44,35,50,45,9,0,0,0,4,0,4,56,18,0,3,9,16,17,59,10,10,8,57,57,100,105,125,41,1,0,6,92,8,14,73,125,29,7,0,5,0,0,8,124,66,6,3,1,63,5,0,1,49,32,17,35,125,21,0,3,2,12,6,109,21,0,0,35,74,125,14,23,0,0,6,50,25,70,64,7,59,18,7,16,22,5,0,1,125,23,1,0,7,30,14,32,4,0,2,2,59,125,19,4,0,0,2,1,6,53,33,2]
        ) AS distance
@@ -128,44 +127,22 @@ LIMIT 10;
 | 803100 | 230.9112 |
 | 866737 | 231.6441 |
 +--------+----------+
-10 rows in set (0.29 sec)
+10 rows in set (0.02 sec)
 ```
 
-When using `l2_distance` or `inner_product`, Doris computes the distance 
between the query vector and all 1,000,000 candidate vectors, then applies a 
TopN operator globally. Using `l2_distance_approximate` / 
`inner_product_approximate` triggers the index path:
-
-```sql
-SELECT id,
-       l2_distance_approximate(
-        embedding,
-        
[0,11,77,24,3,0,0,0,28,70,125,8,0,0,0,0,44,35,50,45,9,0,0,0,4,0,4,56,18,0,3,9,16,17,59,10,10,8,57,57,100,105,125,41,1,0,6,92,8,14,73,125,29,7,0,5,0,0,8,124,66,6,3,1,63,5,0,1,49,32,17,35,125,21,0,3,2,12,6,109,21,0,0,35,74,125,14,23,0,0,6,50,25,70,64,7,59,18,7,16,22,5,0,1,125,23,1,0,7,30,14,32,4,0,2,2,59,125,19,4,0,0,2,1,6,53,33,2]
-       ) AS distance
-FROM sift_1m
-ORDER BY distance
-LIMIT 10;
---------------
+To compare with exact ground truth, use `l2_distance` or `inner_product` 
(without the `_approximate` suffix). In this example, exact search takes ~290 
ms:
 
-+--------+----------+
-| id     | distance |
-+--------+----------+
-| 178811 | 210.1595 |
-| 177646 | 217.0161 |
-| 181997 | 218.5406 |
-| 181605 | 219.2989 |
-| 821938 | 221.7228 |
-| 807785 | 226.7135 |
-| 716433 | 227.3148 |
-| 358802 | 230.7314 |
-| 803100 | 230.9112 |
-| 866737 | 231.6441 |
-+--------+----------+
-10 rows in set (0.02 sec)
+```
+10 rows in set (0.29 sec)
 ```
 
-With the ANN index, query latency in this example drops from about 290 ms to 
20 ms.
+With the ANN index, query latency drops from ~290 ms to ~20 ms in this example.
 
-ANN indexes are built at the segment granularity. Because tables are 
distributed, after each segment returns its local TopN, the TopN operator 
merges results across tablets and segments to produce the global TopN.
+ANN indexes are built at segment granularity. In distributed tables, each 
segment returns its local TopN; then the TopN operator merges results across 
tablets and segments to produce the global TopN.
 
-Note: When `metric_type = l2_distance`, a smaller distance means closer 
vectors. For `inner_product`, a larger value means closer vectors. Therefore, 
if using `inner_product`, you must use `ORDER BY dist DESC` to obtain TopN via 
the index.
+Note on ordering:
+- For `metric_type = l2_distance`, smaller distance = closer vectors → use 
`ORDER BY dist ASC`.
+- For `metric_type = inner_product`, larger value = closer vectors → use 
`ORDER BY dist DESC` to obtain TopN via the index.
 
 ## Approximate Range Search
 
diff --git a/versioned_sidebars/version-4.x-sidebars.json 
b/versioned_sidebars/version-4.x-sidebars.json
index 595eef9e046..0bcc4cd197e 100644
--- a/versioned_sidebars/version-4.x-sidebars.json
+++ b/versioned_sidebars/version-4.x-sidebars.json
@@ -420,7 +420,8 @@
                                 "ai/vector-search/hnsw",
                                 "ai/vector-search/ivf",
                                 "ai/vector-search/index-management",
-                                "ai/vector-search/performance"
+                                "ai/vector-search/performance",
+                                "ai/vector-search/behind-index"
                             ]
                         }
                     ]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris-website) branch master updated: optimization after performance (#3177)

Reply via email to