(doris-website) branch master updated: [lakehouse] add lakehouse overview (#2043)

morningman Mon, 17 Feb 2025 04:28:01 -0800

This is an automated email from the ASF dual-hosted git repository.

morningman pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new c1b13f35b88 [lakehouse] add lakehouse overview (#2043)
c1b13f35b88 is described below

commit c1b13f35b88514b5fe492a78eefc0979f41a64ad
Author: Mingyu Chen (Rayner) <morning...@163.com>
AuthorDate: Mon Feb 17 20:27:48 2025 +0800

    [lakehouse] add lakehouse overview (#2043)
    
    ## Versions
    
    - [x] dev
    - [ ] 3.0
    - [ ] 2.1
    - [ ] 2.0
    
    ## Languages
    
    - [x] Chinese
    - [x] English
    
    ## Docs Checklist
    
    - [x] Checked by AI
    - [ ] Test Cases Built
---
 docs/lakehouse/lakehouse-overview.md               | 150 +++++++++++++++++++-
 .../current/lakehouse/lakehouse-overview.md        | 151 ++++++++++++++++++++-
 .../images/Lakehouse/compute-storage-decouple.png  | Bin 0 -> 430809 bytes
 static/images/Lakehouse/data-management.png        | Bin 0 -> 38700 bytes
 static/images/Lakehouse/federation-query.png       | Bin 0 -> 56776 bytes
 static/images/Lakehouse/lakehouse-arch-1.png       | Bin 0 -> 288586 bytes
 static/images/Lakehouse/performance.png            | Bin 0 -> 82113 bytes
 static/images/Lakehouse/query-acceleration.png     | Bin 0 -> 53713 bytes
 static/images/Lakehouse/tpcds1000.png              | Bin 0 -> 44750 bytes
 9 files changed, 299 insertions(+), 2 deletions(-)

diff --git a/docs/lakehouse/lakehouse-overview.md 
b/docs/lakehouse/lakehouse-overview.md
index 1a7ee83bda8..abadb9bd0b2 100644
--- a/docs/lakehouse/lakehouse-overview.md
+++ b/docs/lakehouse/lakehouse-overview.md
@@ -24,5 +24,153 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-The document is under development, please refer to versioned doc 2.1 or 3.0
+**The lakehouse is a modern big data solution that combines the advantages of 
data lakes and data warehouses**. It integrates the low cost and high 
scalability of data lakes with the high performance and strong data governance 
capabilities of data warehouses, enabling efficient, secure, and 
quality-controlled storage and processing analysis of various data in the big 
data era. Through standardized open data formats and metadata management, it 
unifies **real-time** and **historical** data [...]
 
+## Doris Lakehouse Solution
+
+Doris provides an excellent lakehouse solution for users through an extensible 
connector framework, a compute-storage decoupled architecture, a 
high-performance data processing engine, and data ecosystem openness.
+
+![doris-lakehouse-arch](/images/Lakehouse/lakehouse-arch-1.png)
+
+### Flexible Data Access
+
+Doris supports mainstream data systems and data format access through an 
extensible connector framework and provides unified data analysis capabilities 
based on SQL, allowing users to easily perform cross-platform data queries and 
analysis without moving existing data. For details, refer to [Catalog 
Overview](./catalog-overview.md)
+
+### Data Source Connectors
+
+Whether it's Hive, Iceberg, Hudi, Paimon, or database systems supporting the 
JDBC protocol, Doris can easily connect and efficiently access data.
+
+For lakehouse systems, Doris can obtain the structure and distribution 
information of data tables from metadata services such as Hive Metastore, AWS 
Glue, and Unity Catalog, perform reasonable query planning, and utilize the MPP 
architecture for distributed computing.
+
+For details, refer to each catalog document, such as [Iceberg 
Catalog](./catalogs/iceberg-catalog.md)
+
+#### Extensible Connector Framework
+
+Doris provides a good extensibility framework to help developers quickly 
connect to unique data sources within enterprises, achieving fast data 
interoperability.
+
+Doris defines three levels of standard Catalog, Database, and Table, allowing 
developers to easily map to the required data source levels. Doris also 
provides standard interfaces for metadata service and storage service 
accessing, and developers only need to implement the corresponding interface to 
complete the data source connection.
+
+Doris is compatible with the Trino Connector plugin, allowing the Trino plugin 
package to be directly deployed to the Doris cluster, and with minimal 
configuration, the corresponding data source can be accessed. Doris has already 
completed connections to data sources such as 
[Kudu](./catalogs/kudu-catalog.md), [BigQuery](./catalogs/bigquery-catalog.md), 
and [Delta Lake](./catalogs/delta-lake-catalog.md). You can also [adapt new 
plugins yourself](https://doris.apache.org/community/how-to- [...]
+
+#### Convenient Cross-Source Data Processing
+
+Doris supports creating multiple data catalogs at runtime and using SQL to 
perform federated queries on these data sources. For example, users can 
associate query fact table data in Hive with dimension table data in MySQL:
+
+```sql
+SELECT h.id, m.name
+FROM hive.db.hive_table h JOIN mysql.db.mysql_table m
+ON h.id = m.id;
+```
+
+Combined with Doris's built-in [job 
scheduling](../admin-manual/workload-management/job-scheduler.md) capabilities, 
you can also create scheduled tasks to further simplify system complexity. For 
example, users can set the result of the above query as a routine task executed 
every hour and write each result into an Iceberg table:
+
+```sql
+CREATE JOB schedule_load
+ON SCHEDULE EVERY 1 HOUR DO
+INSERT INTO iceberg.db.ice_table
+SELECT h.id, m.name
+FROM hive.db.hive_table h JOIN mysql.db.mysql_table m
+ON h.id = m.id;
+```
+
+### High-Performance Data Processing
+
+As an analytical data warehouse, Doris has made numerous optimizations in 
lakehouse data processing and computation and provides rich query acceleration 
features:
+
+* Execution Engine
+
+    The Doris execution engine is based on the MPP execution framework and 
Pipeline data processing model, capable of quickly processing massive data in a 
multi-machine, multi-core distributed environment. Thanks to fully vectorized 
execution operators, Doris leads in computing performance in standard benchmark 
datasets like TPC-DS.
+
+* Query Optimizer
+
+    Doris can automatically optimize and process complex SQL requests through 
the query optimizer. The query optimizer deeply optimizes various complex SQL 
operators such as multi-table joins, aggregation, sorting, and pagination, 
fully utilizing cost models and relational algebra transformations to 
automatically obtain better or optimal logical and physical execution plans, 
greatly reducing the difficulty of writing SQL and improving usability and 
performance.
+
+* Data Cache and IO Optimization
+
+    Access to external data sources is usually network access, which can have 
high latency and poor stability. Apache Doris provides rich caching mechanisms 
and has made numerous optimizations in cache types, timeliness, and strategies, 
fully utilizing memory and local high-speed disks to enhance the analysis 
performance of hot data. Additionally, Doris has made targeted optimizations 
for network IO characteristics such as high throughput, low IOPS, and high 
latency, providing external d [...]
+
+* Materialized Views and Transparent Acceleration
+
+    Doris provides rich materialized view update strategies, supporting full 
and partition-level incremental refresh to reduce construction costs and 
improve timeliness. In addition to manual refresh, Doris also supports 
scheduled refresh and data-driven refresh, further reducing maintenance costs 
and improving data consistency. Materialized views also have transparent 
acceleration capabilities, allowing the query optimizer to automatically route 
to appropriate materialized views for sea [...]
+
+As shown below, on a 1TB TPCDS standard test set based on the Iceberg table 
format, Doris's overall execution of 99 queries is only 1/3 of Trino's.
+
+![doris-tpcds](/images/Lakehouse/tpcds1000.png)
+
+In actual user scenarios, Doris reduces average query latency by 20% and 95th 
percentile latency by 50% compared to Presto while using half the resources, 
significantly reducing resource costs while enhancing user experience.
+
+![doris-performance](/images/Lakehouse/performance.png)
+
+### Convenient Service Migration
+
+In the process of integrating multiple data sources and achieving lakehouse 
transformation, migrating SQL queries to Doris is a challenge due to 
differences in SQL dialects across systems in terms of syntax and function 
support. Without a suitable migration plan, the business side may need 
significant modifications to adapt to the new system's SQL syntax.
+
+To address this issue, Doris provides a [SQL Dialect Conversion 
Service](sql-convertor/sql-convertor-overview.md), allowing users to directly 
use SQL dialects from other systems for data queries. The conversion service 
converts these SQL dialects into Doris SQL, greatly reducing user migration 
costs. Currently, Doris supports SQL dialect conversion for common query 
engines such as Presto/Trino, Hive, PostgreSQL, and Clickhouse, achieving a 
compatibility of over 99% in some actual user sc [...]
+
+### Modern Deployment Architecture
+
+Since version 3.0, Doris supports a cloud-native [compute-storage separation 
architecture](../compute-storage-decoupled/overview.md). This architecture, 
with its low cost and high elasticity, effectively improves resource 
utilization and enables independent scaling of compute and storage.
+
+![compute-storage-decouple](/images/Lakehouse/compute-storage-decouple.png)
+
+The above diagram shows the system architecture of Doris's compute-storage 
separation, decoupling compute and storage. Compute nodes no longer store 
primary data, and the underlying shared storage layer (HDFS and object storage) 
serves as the unified primary data storage space, supporting independent 
scaling of compute and storage resources. The compute-storage separation 
architecture brings significant advantages to the lakehouse solution:
+
+* **Low-Cost Storage**: Storage and compute resources can be independently 
scaled, allowing enterprises to increase storage capacity without increasing 
compute resources. Additionally, by using cloud object storage, enterprises can 
enjoy lower storage costs and higher availability, while still using local 
high-speed disks for caching relatively low-proportion hot data.
+
+* **Single Source of Truth**: All data is stored in a unified storage layer, 
allowing the same data to be accessed and processed by different compute 
clusters, ensuring data consistency and integrity, and reducing the complexity 
of data synchronization and duplicate storage.
+
+* **Workload Diversity**: Users can dynamically allocate compute resources 
based on different workload needs, supporting various application scenarios 
such as batch processing, real-time analysis, and machine learning. By 
separating storage and compute, enterprises can more flexibly optimize resource 
usage, ensuring efficient operation under different loads.
+
+In addition, under the storage-computing coupled architecture, [elastic 
computing nodes](./compute-node.md) can still be used to provide elastic 
computing capabilities in lake warehouse data query scenarios.
+
+### Openness
+
+Doris not only supports access to open lake table formats but also has good 
openness for its own stored data. Doris provides an open storage API and 
[implements a high-speed data link based on the Arrow Flight SQL 
protocol](../db-connect/arrow-flight-sql-connect.md), offering the speed 
advantages of Arrow Flight and the ease of use of JDBC/ODBC. Based on this 
interface, users can access data stored in Doris using 
Python/Java/Spark/Flink's ABDC clients.
+
+Compared to open file formats, the open storage API abstracts the specific 
implementation of the underlying file format, allowing Doris to accelerate data 
access through advanced features in its storage format, such as rich indexing 
mechanisms. Additionally, upper-layer compute engines do not need to adapt to 
changes or new features in the underlying storage format, allowing all 
supported compute engines to simultaneously benefit from new features.
+
+## Lakehouse Best Practices
+
+In the lakehouse solution, Doris is mainly used for **lakehouse query 
acceleration**, **multi-source federated analysis**, and **lakehouse data 
processing**.
+
+### Lakehouse Query Acceleration
+
+In this scenario, Doris acts as a **compute engine**, accelerating query 
analysis on lakehouse data.
+
+![query-acceleration](/images/Lakehouse/query-acceleration.png)
+
+#### Cache Acceleration
+
+For lakehouse systems like Hive and Iceberg, users can configure local disk 
caching. Local disk caching automatically stores query-designed data files in 
local cache directories and manages cache eviction using the LRU strategy. For 
details, refer to the [Data Cache](./data-cache.md) document.
+
+#### Materialized Views and Transparent Rewrite
+
+Doris supports creating materialized views for external data sources. 
Materialized views store pre-computed results as Doris internal table formats 
based on SQL definition statements. Additionally, Doris's query optimizer 
supports a transparent rewrite algorithm based on the SPJG 
(SELECT-PROJECT-JOIN-GROUP-BY) pattern. This algorithm can analyze the 
structure information of SQL, automatically find suitable materialized views 
for transparent rewrite, and select the optimal materialized vi [...]
+
+This feature can significantly improve query performance by reducing runtime 
computation. It also allows access to data in materialized views through 
transparent rewrite without business awareness. For details, refer to the 
[Materialized 
Views](../query-acceleration/materialized-view/async-materialized-view/overview.md)
 document.
+
+### Multi-Source Federated Analysis
+
+Doris can act as a **unified SQL query engine**, connecting different data 
sources for federated analysis, solving data silos.
+
+![federation-query](/images/Lakehouse/federation-query.png)
+
+Users can dynamically create multiple catalogs in Doris to connect different 
data sources. They can use SQL statements to perform arbitrary join queries on 
data from different data sources. For details, refer to the [Catalog 
Overview](catalog-overview.md).
+
+### Lakehouse Data Processing
+
+In this scenario, **Doris acts as a data processing engine**, processing 
lakehouse data.
+
+![data-management](/images/Lakehouse/data-management.png)
+
+#### Task Scheduling
+
+Doris introduces the Job Scheduler feature, enabling efficient and flexible 
task scheduling, reducing dependency on external systems. Combined with data 
source connectors, users can achieve periodic processing and storage of 
external data. For details, refer to the [Job 
Scheduler](../admin-manual/workload-management/job-scheduler.md).
+
+#### Data Modeling
+
+User typically use data lakes to store raw data and perform layered data 
processing on this basis, making different layers of data available to 
different business needs. Doris's materialized view feature supports creating 
materialized views for external data sources and supports further processing 
based on materialized views, reducing system complexity and improving data 
processing efficiency.
+
+#### Data Write-Back
+
+The data write-back feature forms a closed loop of Doris's lakehouse data 
processing capabilities. Users can directly create databases and tables in 
external data sources through Doris and write data. Currently, JDBC, Hive, and 
Iceberg data sources are supported, with more data sources to be added in the 
future. For details, refer to the documentation of the corresponding data 
source.
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/lakehouse-overview.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/lakehouse-overview.md
index 78b437aab2f..f2d1d45ff9b 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/lakehouse-overview.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/lakehouse-overview.md
@@ -24,5 +24,154 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-文章更新中，请先参阅 2.1/3.0 版本文档。
+**湖仓一体是将数据湖和数据仓库的优势相结合的现代化大数据解决方案**。其融合了数据湖的低成本、高扩展性与数据仓库的高性能、强数据治理能力，从而实现对大数据时代各类数据的高效、安全、质量可控的存储和处理分析。同时通过标准化的数据格式和元数据管理，统一了实时、历史数据，批处理和流处理，正在逐步成为企业大数据解决方案新的标准。
+
+## Doris 湖仓一体解决方案
+
+Doris 通过可扩展的连接器框架、存算分离架构、高性能的数据处理引擎和数据生态开放性，为用户提供了优秀的湖仓一体解决方案。
+
+![doris-lakehouse-arch](/images/Lakehouse/lakehouse-arch-1.png)
+
+### 灵活的数据接入
+
+Doris 通过可扩展的连接器框架，支持主流数据系统和数据格式接入，并提供基于 SQL 
的统一数据分析能力，用户能够在不移动现有数据的情况下，轻松实现跨平台的数据查询与分析。具体可参阅 [数据目录概述](./catalog-overview.md)
+
+### 数据源连接器
+
+无论是 Hive、Iceberg、Hudi、Paimon，还是支持 JDBC 协议的数据库系统，Doris 均能轻松连接并高效访问数据。
+
+对于湖仓系统，Doris 可从元数据服务，如 Hive Metastore，AWS Glue、Unity Catalog 
中获取数据表的结构和分布信息，进行合理的查询规划，并利用 MPP 架构进行分布式计算。
+
+具体可参阅各数据目录文档，如 [Iceberg Catalog](./catalogs/iceberg-catalog.md)
+
+#### 可扩展的连接器框架
+
+Doris 提供良好的扩展性框架，帮助开发人员快速对接企业内部特有的数据源，实现数据快速互通。
+
+Doris 
定义了标准的数据目录（Catalog）、数据库（Database）、数据表（Table）三个层级，开发人员可以方便的映射到所需对接的数据源层级。Doris 
同时提供标准的元数据服务和数据读取服务的接口，开发人员只需按照接口定义实现对应的访问逻辑，即可完成数据源的对接。
+
+Doris 兼容 Trino Connector 插件，可直接将 Trino 插件包部署到 Doris 集群，经过少量配置即可访问对应的数据源。Doris 
目前已经完成了 
[Kudu](./catalogs/kudu-catalog.md)、[BigQuery](./catalogs/bigquery-catalog.md)、[Delta
 Lake](./catalogs/delta-lake-catalog.md) 等数据源的对接。也可以 
[自行适配新的插件](https://doris.apache.org/community/how-to-contribute/trino-connector-developer-guide)。
+
+#### 便捷的跨源数据处理
+
+Doris 支持在运行时直接创建多个数据源连接器，并使用 SQL 对这些数据源进行联邦查询。比如用户可以将 Hive 中的事实表数据与 MySQL 
中的维度表数据进行关联查询：
+
+```sql
+SELECT h.id, m.name
+FROM hive.db.hive_table h JOIN mysql.db.mysql_table m
+ON h.id = m.id;
+```
+
+结合 Doris 内置的 [作业调度](../admin-manual/workload-management/job-scheduler.md) 
能力，还可以创建定时任务，进一步简化系统复杂度。比如用户可以将上述查询的结果，设定为每小时执行一次的例行任务，并将每次的结果，写入一张 Iceberg 表：
+
+```sql
+CREATE JOB schedule_load
+ON SCHEDULE EVERY 1 HOUR DO
+INSERT INTO iceberg.db.ice_table
+SELECT h.id, m.name
+FROM hive.db.hive_table h JOIN mysql.db.mysql_table m
+ON h.id = m.id;
+```
+
+### 高性能的数据处理
+
+Doris 作为分析型数据仓库，在湖仓数据处理和计算方面做了大量优化，并提供了丰富的查询加速功能：
+
+* 执行引擎
+
+    Doris 执行引擎基于 MPP 执行框架和 Pipeline 
数据处理模型，能够很好的在多机多核的分布式环境下快速处理海量数据。同时，得益于完全的向量化执行算子，在计算性能方面，Doris 在 TPC-DS 
等标准评测数据集中处于领先地位。
+
+* 查询优化器
+
+    Doris 能通过查询优化器自动优化和处理复杂的 SQL 请求。查询优化器针对多表关联、聚合、排序、分页等多种复杂 SQL 
算子进行了深度优化，充分利用代价模型和关系代数变化，自动获取较优或最优的逻辑执行计划和物理执行计划，极大降低用户编写 SQL 的难度，提升易用性和性能。
+
+* 缓存加速与 IO 优化
+
+    外部数据源的访问，通常是网络访问，因此存在延迟高、稳定性差等问题。Apache Doris 
提供了丰富的缓存机制，并在缓存的类型、时效性、策略方面都做了大量的优化，充分利用内存和本地高速磁盘，提升热点数据的分析性能。同时，针对网络 IO 高吞吐、低 
IOPS、高延迟的特性，Doris 也进行了针对性的优化，可以提供媲美本地数据的外部数据源访问性能。
+
+* 物化视图与透明加速
+
+    Doris 提供丰富的物化视图更新策略，支持全量和分区级别的增量刷新，以降低构建成本并提升时效性。除手动刷新外，Doris 
还支持定时刷新和数据驱动刷新，进一步降低维护成本并提高数据一致性。物化视图还具备透明加速功能，查询优化器能够自动路由到合适的物化视图，实现无缝查询加速。此外，Doris
 的物化视图采用高性能存储格式，通过列存、压缩和智能索引技术，提供高效的数据访问能力，能够作为数据缓存的替代方案，提升查询效率。
+
+如下所示，在基于 Iceberg 表格式的 1TB 的 TPCDS 标准测试集上，Doris 执行 99 个查询的总体运行仅为 Trino 的 1/3。
+
+![doris-tpcds](/images/Lakehouse/tpcds1000.png)
+
+实际用户场景中，Doris 在使用一半资源的情况下，相比 Presto 平均查询延迟降低了 20%，95 分位延迟更是降低 
50%。在提升用户体验的同时，极大降低了资源成本。
+
+![doris-performance](/images/Lakehouse/performance.png)
+
+### 便捷的业务迁移
+
+在企业整合多个数据源并实现湖仓一体转型的过程中，迁移业务的 SQL 查询到 Doris 是一项挑战，因为不同系统的 SQL 
方言在语法和函数支持上存在差异。若没有合适的迁移方案，业务侧可能需要进行大量改造以适应新系统的 SQL 语法。
+
+为了解决这个问题，Doris 提供了 [SQL 
方言转换服务](sql-convertor/sql-convertor-overview.md)，允许用户直接使用其他系统的 SQL 
方言进行数据查询。转换服务会将这些 SQL 方言转换为 Doris SQL，极大降低了用户的迁移成本。目前，Doris 支持 
Presto/Trino、Hive、PostgreSQL 和 Clickhouse 等常见查询引擎的 SQL 方言转换，在某些实际用户场景中，兼容率可达到 
99% 以上。
+
+### 现代化的部署架构
+
+自 3.0 版本以来，Doris 支持面向云原生的 
[存算分离架构](../compute-storage-decoupled/overview.md)。这一架构凭借低成本和高弹性的特点，能够有效提高资源利用率，实现计算和存储的独立扩展。
+
+![compute-storage-decouple](/images/Lakehouse/compute-storage-decouple.png)
+
+上图是 Doris 存算分离的系统架构，对计算与存储进行了解耦，计算节点不再存储主数据，底层共享存储层（HDFS 
与对象存储）作为统一的数据主存储空间，并支持计算资源和存储资源独立扩缩容。存算分离架构为湖仓一体解决方案带来了显著的优势：
+
+* 
**低成本存储**：储和计算资源可独立扩展，企业可以根据需要增加存储容量而不必增加计算资源。同时，通过使用云上的对象存储，企业可以享受更低的存储成本和更高的可用性，对于比例相对较低的热点数据，依然可以使用本地高速磁盘进行缓存。
+
+* **唯一可信来源**：有数据都存储在统一的存储层中，同一份数据供不同的计算集群访问和处理，确保数据的一致性和完整性，也减少数据同步和重复存储的复杂性。
+
+* 
**负载多样性**：以根据不同的工作负载需求动态调配计算资源，支持批处理、实时分析和机器学习等多种应用场景。通过分离存储和计算，企业可以更灵活地优化资源使用，确保在不同负载下的高效运行。
+
+此外，在存算一体架构下，依然可以通过 [弹性计算节点](./compute-node.md) 在湖仓数据查询场景提供弹性计算能力。
+
+### 开放性
+
+Doris 不仅支持开放湖表格式的访问，其自身存储的数据同样拥有良好的开放性。Doris 提供了开放存储 API，并[基于 Arrow Flight SQL 
协议实现了高速数据链路](../db-connect/arrow-flight-sql-connect.md)，具备 Arrow Flight 的速度优势以及 
JDBC/ODBC 的易用性。基于该接口，用户可以使用 Python/Java/Spark/Flink 的 ABDC 客户端访问 Doris 中存储的数据。
+
+与开放文件格式相比，开放存储 API 屏蔽了底层的文件格式的具体实现，Doris 
可以通过自身存储格式中的高级特性，如丰富的索引机制来加速数据访问。同时，上层的计算引擎无需对底层存储格式的变更或新特性进行适配，所有支持的该协议的计算引擎都可以同步享受到新特性带来的收益。
+
+## 湖仓一体最佳实践
+
+Doris 在湖仓一体方案中，主要用于 **湖仓查询加速**、**多源联邦分析** 和 **湖仓数据处理**。
+
+### 湖仓查询加速
+
+在该场景中，Doris 作为 **计算引擎**，对湖仓中数据进行查询分析加速。
+
+![query-acceleration](/images/Lakehouse/query-acceleration.png)
+
+#### 缓存加速
+
+针对 Hive、Iceberg 等湖仓系统，用户可以配置本地磁盘缓存。本地磁盘缓存会自动将查询设计的数据文件存储在本地缓存目录中，并使用 LRU 
策略管理缓存的汰换。具体可参阅 [数据缓存](./data-cache.md) 文档。
+
+#### 物化视图与透明改写
+
+Doris 支持对外部数据源创建物化视图。物化视图根据 SQL 定义语句，预先将计算结果存储为 Doris 内表格式。同时，Doris 的查询优化器支持基于 
SPJG（SELECT-PROJECT-JOIN-GROUP-BY）模式的透明改写算法。该算法能够分析 SQL 
的结构信息，自动寻找合适的物化视图进行透明改写，并选择最优的物化视图来响应查询 SQL。
+
+该功能通过减少运行时的计算量，可显著提升查询性能。同时可以在业务无感知的情况下，通过透明改写访问到物化视图中的数据。具体可参阅 
[物化视图](../query-acceleration/materialized-view/async-materialized-view/overview.md)
 文档。
+
+### 多源联邦分析
+
+Doris 可以作为 **统一 SQL 查询引擎**，连接不同数据源进行联邦分析，解决数据孤岛。
+
+![federation-query](/images/Lakehouse/federation-query.png)
+
+用户可以在 Doris 中动态创建多个 Catalog 连接不同的数据源。并使用 SQL 语句对不同数据源中的数据进行任意关联查询。具体可参阅 
[数据目录概述](catalog-overview.md)。
+
+### 湖仓数据处理
+
+在该场景中，**Doris 作为数据处理引擎**，对湖仓数据进行加工处理。
+
+![data-management](/images/Lakehouse/data-management.png)
+
+#### 定时任务调度
+
+Doris 通过引入 Job Scheduler 
功能，可以实现高效灵活的任务调度，减少了对外部系统的依赖。结合数据源连接器，用户可以实现外部数据的定期加工入库。具体可参阅 
[作业调度](../admin-manual/workload-management/job-scheduler.md)。
+
+#### 数据分层加工
+
+企业通常会使用数据湖存储原始数据，在此基础上进行数据分层加工，将不同层的数据开放给不同的业务需求方。Doris 
的物化视图功能支持对外部数据源创建物化视图，并支持在基于物化视图在加工，降低了分层加工的系统复杂度，提升数据处理效率。
+
+#### 数据写回
+
+数据写回功能将 Doris 的湖仓数据处理能力形成闭环。户可以直接通过 Doris 在外部数据源中创建数据库、表，并写入数据。当前支持 JDBC、Hive 
和 Iceberg 三类数据源，后续会增加更多的数据源支持。具体可以参阅对应数据源的文档。
 
diff --git a/static/images/Lakehouse/compute-storage-decouple.png 
b/static/images/Lakehouse/compute-storage-decouple.png
new file mode 100644
index 00000000000..94f6a7b1b69
Binary files /dev/null and 
b/static/images/Lakehouse/compute-storage-decouple.png differ
diff --git a/static/images/Lakehouse/data-management.png 
b/static/images/Lakehouse/data-management.png
new file mode 100644
index 00000000000..ebfc9755e52
Binary files /dev/null and b/static/images/Lakehouse/data-management.png differ
diff --git a/static/images/Lakehouse/federation-query.png 
b/static/images/Lakehouse/federation-query.png
new file mode 100644
index 00000000000..1a94a452286
Binary files /dev/null and b/static/images/Lakehouse/federation-query.png differ
diff --git a/static/images/Lakehouse/lakehouse-arch-1.png 
b/static/images/Lakehouse/lakehouse-arch-1.png
new file mode 100644
index 00000000000..aef267c7278
Binary files /dev/null and b/static/images/Lakehouse/lakehouse-arch-1.png differ
diff --git a/static/images/Lakehouse/performance.png 
b/static/images/Lakehouse/performance.png
new file mode 100644
index 00000000000..31d2137c708
Binary files /dev/null and b/static/images/Lakehouse/performance.png differ
diff --git a/static/images/Lakehouse/query-acceleration.png 
b/static/images/Lakehouse/query-acceleration.png
new file mode 100644
index 00000000000..a322a14006c
Binary files /dev/null and b/static/images/Lakehouse/query-acceleration.png 
differ
diff --git a/static/images/Lakehouse/tpcds1000.png 
b/static/images/Lakehouse/tpcds1000.png
new file mode 100644
index 00000000000..780a00de078
Binary files /dev/null and b/static/images/Lakehouse/tpcds1000.png differ


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

(doris-website) branch master updated: [lakehouse] add lakehouse overview (#2043)

Reply via email to