This is an automated email from the ASF dual-hosted git repository. kassiez pushed a commit to branch KassieZ-patch-1 in repository https://gitbox.apache.org/repos/asf/doris.git
commit a7de0fccfb482d93e9580d3b43645acc9dff82bc Author: KassieZ <139741991+kass...@users.noreply.github.com> AuthorDate: Tue Mar 18 12:11:02 2025 +0800 Update README.md --- README.md | 119 ++++++++++++++++++++++++++++++++++++++++---------------------- 1 file changed, 78 insertions(+), 41 deletions(-) diff --git a/README.md b/README.md index d850df18001..3d8e4233a85 100644 --- a/README.md +++ b/README.md @@ -25,8 +25,8 @@ under the License. [](https://github.com/apache/doris/releases) [](https://ossrank.com/p/516) [](https://github.com/apache/doris/commits/master/) -[](https://doris.apache.org/docs/get-starting/quick-start) -[](https://doris.apache.org/zh-CN/docs/get-starting/quick-start/) +[](https://doris.apache.org/docs/gettingStarted/what-is-apache-doris) +[](https://doris.apache.org/zh-CN/docs/gettingStarted/what-is-apache-doris) <div> @@ -42,7 +42,7 @@ under the License. <a href="https://github.com/apache/doris/discussions"><img src="https://img.shields.io/badge/- Discussion -red?style=social&logo=discourse" height=25></a> - <a href="https://apachedoriscommunity.slack.com/join/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA"><img src="https://img.shields.io/badge/-Slack-red?style=social&logo=slack" height=25></a> + <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2unfw3a3q-MtjGX4pAd8bCGC1UV0sKcw" height=25></a> <a href="https://medium.com/@ApacheDoris"><img src="https://img.shields.io/badge/-Medium-red?style=social&logo=medium" height=25></a> @@ -73,65 +73,104 @@ As shown in the figure below, after various data integration and processing, the <br /> + Apache Doris is widely used in the following scenarios: -- Reporting Analysis +- **Real-time Data Analysis**: - - Real-time dashboards - - Reports for in-house analysts and managers - - Highly concurrent user-oriented or customer-oriented report analysis: such as website analysis and ad reporting that usually require thousands of QPS and quick response times measured in milliseconds. A successful user case is that Doris has been used by the Chinese e-commerce giant JD.com in ad reporting, where it receives 10 billion rows of data per day, handles over 10,000 QPS, and delivers a 99 percentile query latency of 150 ms. + - **Real-time Reporting and Decision-making**: Doris provides real-time updated reports and dashboards for both internal and external enterprise use, supporting real-time decision-making in automated processes. + + - **AD Hoc Analysis**: Doris offers multidimensional data analysis capabilities, enabling rapid business intelligence analysis and ad hoc queries to help users quickly uncover insights from complex data. + + - **User Profiling and Behavior Analysis**: Doris can analyze user behavior such as participation, retention, and conversion, while also supporting scenarios like population insights and crowd selection for behavior analysis. -- Ad-Hoc Query. Analyst-oriented self-service analytics with irregular query patterns and high throughput requirements. XiaoMi has built a growth analytics platform (Growth Analytics, GA) based on Doris, using user behavior data for business growth analysis, with an average query latency of 10 seconds and a 95th percentile query latency of 30 seconds or less, and tens of thousands of SQL queries per day. +- **Lakehouse Analytics**: -- Unified Data Warehouse Construction. Apache Doris allows users to build a unified data warehouse via one single platform and save the trouble of handling complicated software stacks. Chinese hot pot chain Haidilao has built a unified data warehouse with Doris to replace its old complex architecture consisting of Apache Spark, Apache Hive, Apache Kudu, Apache HBase, and Apache Phoenix. + - **Lakehouse Query Acceleration**: Doris accelerates lakehouse data queries with its efficient query engine. + + - **Federated Analytics**: Doris supports federated queries across multiple data sources, simplifying architecture and eliminating data silos. + + - **Real-time Data Processing**: Doris combines real-time data streams and batch data processing capabilities to meet the needs of high concurrency and low-latency complex business requirements. -- Data Lake Query. Apache Doris avoids data copying by federating the data in Apache Hive, Apache Iceberg, and Apache Hudi using external tables, and thus achieves outstanding query performance. +- **SQL-based Observability**: -## 🖥️ Core Concepts + - **Log and Event Analysis**: Doris enables real-time or batch analysis of logs and events in distributed systems, helping to identify issues and optimize performance. -### 📂 Architecture of Apache Doris -The overall architecture of Apache Doris is shown in the following figure. The Doris architecture is very simple, with only two types of processes. +## Overall Architecture -- Frontend (FE): user request access, query parsing and planning, metadata management, node management, etc. +Apache Doris uses the MySQL protocol, is highly compatible with MySQL syntax, and supports standard SQL. Users can access Apache Doris through various client tools, and it seamlessly integrates with BI tools. -- Backend (BE): data storage and query plan execution +### Storage-Compute Integrated Architecture -Both types of processes are horizontally scalable, and a single cluster can support up to hundreds of machines and tens of petabytes of storage capacity. And these two types of processes guarantee high availability of services and high reliability of data through consistency protocols. This highly integrated architecture design greatly reduces the operation and maintenance cost of a distributed system. +The storage-compute integrated architecture of Apache Doris is streamlined and easy to maintain. As shown in the figure below, it consists of only two types of processes: -<br /> +- **Frontend (FE):** Primarily responsible for handling user requests, query parsing and planning, metadata management, and node management tasks. + +- **Backend (BE):** Primarily responsible for data storage and query execution. Data is partitioned into shards and stored with multiple replicas across BE nodes.  <br /> -In terms of interfaces, Apache Doris adopts MySQL protocol, supports standard SQL, and is highly compatible with MySQL dialect. Users can access Doris through various client tools and it supports seamless connection with BI tools. +In a production environment, multiple FE nodes can be deployed for disaster recovery. Each FE node maintains a full copy of the metadata. The FE nodes are divided into three roles: + +| Role | Function | +| --------- | ------------------------------------------------------------ | +| Master | The FE Master node is responsible for metadata read and write operations. When metadata changes occur in the Master, they are synchronized to Follower or Observer nodes via the BDB JE protocol. | +| Follower | The Follower node is responsible for reading metadata. If the Master node fails, a Follower node can be selected as the new Master. | +| Observer | The Observer node is responsible for reading metadata and is mainly used to increase query concurrency. It does not participate in cluster leadership elections. | + +Both FE and BE processes are horizontally scalable, enabling a single cluster to support hundreds of machines and tens of petabytes of storage capacity. The FE and BE processes use a consistency protocol to ensure high availability of services and high reliability of data. The storage-compute integrated architecture is highly integrated, significantly reducing the operational complexity of distributed systems. + + +## Core Features of Apache Doris + +- **High Availability**: In Apache Doris, both metadata and data are stored with multiple replicas, synchronizing data logs via the quorum protocol. Data write is considered successful once a majority of replicas have completed the write, ensuring that the cluster remains available even if a few nodes fail. Apache Doris supports both same-city and cross-region disaster recovery, enabling dual-cluster master-slave modes. When some nodes experience failures, the cluster can automatically i [...] + +- **High Compatibility**: Apache Doris is highly compatible with the MySQL protocol and supports standard SQL syntax, covering most MySQL and Hive functions. This high compatibility allows users to seamlessly migrate and integrate existing applications and tools. Apache Doris supports the MySQL ecosystem, enabling users to connect Doris using MySQL Client tools for more convenient operations and maintenance. It also supports MySQL protocol compatibility for BI reporting tools and data tr [...] + +- **Real-Time Data Warehouse**: Based on Apache Doris, a real-time data warehouse service can be built. Apache Doris offers second-level data ingestion capabilities, capturing incremental changes from upstream online transactional databases into Doris within seconds. Leveraging vectorized engines, MPP architecture, and Pipeline execution engines, Doris provides sub-second data query capabilities, thereby constructing a high-performance, low-latency real-time data warehouse platform. -### 💾 Storage Engine +- **Unified Lakehouse**: Apache Doris can build a unified lakehouse architecture based on external data sources such as data lakes or relational databases. The Doris unified lakehouse solution enables seamless integration and free data flow between data lakes and data warehouses, helping users directly utilize data warehouse capabilities to solve data analysis problems in data lakes while fully leveraging data lake data management capabilities to enhance data value. -Doris uses a columnar storage engine, which encodes, compresses, and reads data by column. This enables a very high compression ratio and largely reduces irrelavant data scans, thus making more efficient use of IO and CPU resources. Doris supports various index structures to minimize data scans: +- **Flexible Modeling**: Apache Doris offers various modeling approaches, such as wide table models, pre-aggregation models, star/snowflake schemas, etc. During data import, data can be flattened into wide tables and written into Doris through compute engines like Flink or Spark, or data can be directly imported into Doris, performing data modeling operations through views, materialized views, or real-time multi-table joins. -- Sorted Compound Key Index: Users can specify three columns at most to form a compound sort key. This can effectively prune data to better support highly concurrent reporting scenarios. -- MIN/MAX Indexing: This enables effective filtering of equivalence and range queries for numeric types. -- Bloom Filter: very effective in equivalence filtering and pruning of high cardinality columns -- Invert Index: This enables fast search for any field. +## Technical overview +Doris provides an efficient SQL interface and is fully compatible with the MySQL protocol. Its query engine is based on an MPP (Massively Parallel Processing) architecture, capable of efficiently executing complex analytical queries and achieving low-latency real-time queries. Through columnar storage technology for data encoding and compression, it significantly optimizes query performance and storage compression ratio. -### 💿 Storage Models +### Interface -Doris supports a variety of storage models and has optimized them for different scenarios: +Apache Doris adopts the MySQL protocol, supports standard SQL, and is highly compatible with MySQL syntax. Users can access Apache Doris through various client tools and seamlessly integrate it with BI tools, including but not limited to Smartbi, DataEase, FineBI, Tableau, Power BI, and Apache Superset. Apache Doris can work as the data source for any BI tools that support the MySQL protocol. -- Aggregate Key Model: able to merge the value columns with the same keys and significantly improve performance +### Storage engine -- Unique Key Model: Keys are unique in this model and data with the same key will be overwritten to achieve row-level data updates. +Apache Doris has a columnar storage engine, which encodes, compresses, and reads data by column. This enables a very high data compression ratio and largely reduces unnecessary data scanning, thus making more efficient use of IO and CPU resources. -- Duplicate Key Model: This is a detailed data model capable of detailed storage of fact tables. +Apache Doris supports various index structures to minimize data scans: -Doris also supports strongly consistent materialized views. Materialized views are automatically selected and updated, which greatly reduces maintenance costs for users. +- **Sorted Compound Key Index**: Users can specify three columns at most to form a compound sort key. This can effectively prune data to better support highly concurrent reporting scenarios. + +- **Min/Max Index**: This enables effective data filtering in equivalence and range queries of numeric types. + +- **BloomFilter Index**: This is very effective in equivalence filtering and pruning of high-cardinality columns. + +- **Inverted Index**: This enables fast searching for any field. + +Apache Doris supports a variety of data models and has optimized them for different scenarios: + +- **Detail Model (Duplicate Key Model):** A detail data model designed to meet the detailed storage requirements of fact tables. + +- **Primary Key Model (Unique Key Model):** Ensures unique keys; data with the same key is overwritten, enabling row-level data updates. + +- **Aggregate Model (Aggregate Key Model):** Merges value columns with the same key, significantly improving performance through pre-aggregation. + +Apache Doris also supports strongly consistent single-table materialized views and asynchronously refreshed multi-table materialized views. Single-table materialized views are automatically refreshed and maintained by the system, requiring no manual intervention from users. Multi-table materialized views can be refreshed periodically using in-cluster scheduling or external scheduling tools, reducing the complexity of data modeling. ### 🔍 Query Engine -Doris adopts the MPP model in its query engine to realize parallel execution between and within nodes. It also supports distributed shuffle join for multiple large tables so as to handle complex queries. +Apache Doris has an MPP-based query engine for parallel execution between and within nodes. It supports distributed shuffle join for large tables to better handle complicated queries. <br /> @@ -139,7 +178,7 @@ Doris adopts the MPP model in its query engine to realize parallel execution bet <br /> -The Doris query engine is vectorized, with all memory structures laid out in a columnar format. This can largely reduce virtual function calls, improve cache hit rates, and make efficient use of SIMD instructions. Doris delivers a 5–10 times higher performance in wide table aggregation scenarios than non-vectorized engines. +The query engine of Apache Doris is fully vectorized, with all memory structures laid out in a columnar format. This can largely reduce virtual function calls, increase cache hit rates, and make efficient use of SIMD instructions. Apache Doris delivers a 5~10 times higher performance in wide table aggregation scenarios than non-vectorized engines. <br /> @@ -147,14 +186,12 @@ The Doris query engine is vectorized, with all memory structures laid out in a c <br /> -Apache Doris uses Adaptive Query Execution technology to dynamically adjust the execution plan based on runtime statistics. For example, it can generate runtime filter, push it to the probe side, and automatically penetrate it to the Scan node at the bottom, which drastically reduces the amount of data in the probe and increases join performance. The runtime filter in Doris supports In/Min/Max/Bloom filter. - -### 🚅 Query Optimizer +Apache Doris uses adaptive query execution technology to dynamically adjust the execution plan based on runtime statistics. For example, it can generate a runtime filter and push it to the probe side. Specifically, it pushes the filters to the lowest-level scan node on the probe side, which largely reduces the data amount to be processed and increases join performance. The runtime filter of Apache Doris supports In/Min/Max/Bloom Filter. -In terms of optimizers, Doris uses a combination of CBO and RBO. RBO supports constant folding, subquery rewriting, predicate pushdown and CBO supports Join Reorder. The Doris CBO is under continuous optimization for more accurate statistical information collection and derivation, and more accurate cost model prediction. +Apache Doris uses a Pipeline execution engine that breaks down queries into multiple sub-tasks for parallel execution, fully leveraging multi-core CPU capabilities. It simultaneously addresses the thread explosion problem by limiting the number of query threads. The Pipeline execution engine reduces data copying and sharing, optimizes sorting and aggregation operations, thereby significantly improving query efficiency and throughput. +In terms of the optimizer, Apache Doris employs a combined optimization strategy of CBO (Cost-Based Optimizer), RBO (Rule-Based Optimizer), and HBO (History-Based Optimizer). RBO supports constant folding, subquery rewriting, predicate pushdown, and more. CBO supports join reordering and other optimizations. HBO recommends the optimal execution plan based on historical query information. These multiple optimization measures ensure that Doris can enumerate high-performance query plans acr [...] -**Technical Overview**: 🔗[Introduction to Apache Doris](https://doris.apache.org/docs/dev/summary/basic-summary) ## 🎆 Why choose Apache Doris? @@ -190,7 +227,7 @@ Add your company logo at Apache Doris Website: 🔗[Add Your Company](https://gi ### 📚 Docs -All Documentation 🔗[Docs](https://doris.apache.org/docs/get-starting/quick-start) +All Documentation 🔗[Docs](https://doris.apache.org/docs/gettingStarted/what-is-apache-doris) ### ⬇️ Download @@ -198,11 +235,11 @@ All release and binary version 🔗[Download](https://doris.apache.org/download) ### 🗄️ Compile -See how to compile 🔗[Compilation](https://doris.apache.org/docs/dev/install/source-install/compilation-general) +See how to compile 🔗[Compilation](https://doris.apache.org/community/source-install/compilation-with-docker)) ### 📮 Install -See how to install and deploy 🔗[Installation and deployment](https://doris.apache.org/docs/dev/install/cluster-deployment/standard-deployment) +See how to install and deploy 🔗[Installation and deployment](https://doris.apache.org/docs/install/preparation/env-checking) ## 🧩 Components @@ -248,7 +285,7 @@ Contact us through the following mailing list. * Apache Doris Official Website - [Site](https://doris.apache.org) * Developer Mailing list - <d...@doris.apache.org>. Mail to <dev-subscr...@doris.apache.org>, follow the reply to subscribe the mail list. -* Slack channel - [Join the Slack](https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-28il1o2wk-DD6LsLOz3v4aD92Mu0S0aQ) +* Slack channel - [Join the Slack](https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2unfw3a3q-MtjGX4pAd8bCGC1UV0sKcw) * Twitter - [Follow @doris_apache](https://twitter.com/doris_apache) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org