This is an automated email from the ASF dual-hosted git repository. englefly pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 5757b60b0f7 [Improvement](docs) refine query acceleration doc (#1610) 5757b60b0f7 is described below commit 5757b60b0f7a176a26f2874a3c2ba0a663687f51 Author: xzj7019 <xiongzhongj...@selectdb.com> AuthorDate: Wed Dec 25 21:34:14 2024 +0800 [Improvement](docs) refine query acceleration doc (#1610) ## Versions - [x] dev - [ ] 3.0 - [ ] 2.1 - [ ] 2.0 ## Languages - [ ] Chinese - [x] English ## Docs Checklist - [ ] Checked by AI - [ ] Test Cases Built --- .../accelerating-queries-with-sql-cache.md | 72 +----- .../tuning/tuning-plan/adjusting-join-shuffle.md | 3 +- .../tuning-plan/controlling-hints-with-cbo-rule.md | 2 +- .../tuning/tuning-plan/dml-tuning-plan.md | 96 +------- .../transparent-rewriting-with-async-mv.md | 268 ++++++++------------- .../transparent-rewriting-with-sync-mv.md | 59 +++-- 6 files changed, 146 insertions(+), 354 deletions(-) diff --git a/docs/query-acceleration/tuning/tuning-plan/accelerating-queries-with-sql-cache.md b/docs/query-acceleration/tuning/tuning-plan/accelerating-queries-with-sql-cache.md index 86c34727adb..cb2fb2254bd 100644 --- a/docs/query-acceleration/tuning/tuning-plan/accelerating-queries-with-sql-cache.md +++ b/docs/query-acceleration/tuning/tuning-plan/accelerating-queries-with-sql-cache.md @@ -24,69 +24,17 @@ specific language governing permissions and limitations under the License. --> -## Principle +## Overview +For the detailed implementation principle of SQL Cache, please refer to the chapter [SQL Cache](../../../query-acceleration/sql-cache-manual). -For detailed implementation principles of SQL Cache, please refer to [SQL Cache](../../../query-acceleration/sql-cache-manual). - -## Use Case - -The following case illustrates how to enable and use SQL Cache in Doris: - -1. Ensure that `cache_enable_sql_mode` in `fe.conf` is set to `true` (default is `true`): - - ```sql - vim fe/conf/fe.conf - cache_enable_sql_mode=true - ``` - - -2. Set the variable in the MySQL command line: - - ```sql - MySQL [(none)]> set global enable_sql_cache=true; - ``` - - Note: `GLOBAL` indicates a global variable, not just for the current session. - -3. In Doris versions 2.1.3 and above, you can control the number of cache key entries and the cleanup time with the following commands: - - ```sql - MySQL [(none)]> ADMIN SET FRONTEND CONFIG ('sql_cache_manage_num' = '100'); - MySQL [(none)]> ADMIN SET FRONTEND CONFIG ('expire_sql_cache_in_fe_second' = '300'); - ``` - - -4. Execute the query - - Suppose we have a table named "sales" containing information on date, product, and sales amount. We need to query the total sales for each product over the past 30 days: - - ```sql - SELECT product, SUM(amount) as total_sales - FROM sales - WHERE date >= DATE_SUB(CURDATE(), INTERVAL 30 DAY) - GROUP BY product - ORDER BY total_sales DESC - LIMIT 10; - ``` - - When this query is executed for the first time, Doris retrieves the results from the BE and stores them in the cache. Subsequent executions of the same query will retrieve the results directly from the cache if the data has not been updated, significantly improving query speed. - -5. Cache Conditions - - After the initial query, the query results will be cached if the following three conditions are met: - - - (Current time - last update time of the query partition) is greater than `cache_last_version_interval_second` in `fe.conf`. - - - The number of query result rows is less than `cache_result_max_row_count` in `fe.conf`. - - - The number of bytes in the query results is less than `cache_result_max_data_size` in `fe.conf`. +## Case +For detailed cases, please refer to the chapter [SQL Cache](../../../query-acceleration/sql-cache-manual). ## Summary +SQL Cache is a query optimization mechanism provided by Doris, which can significantly improve query performance. When using it, the following points should be noted: -SQL Cache is a query optimization mechanism provided by Doris that can significantly improve query performance. When using it, please note: - -1. SQL Cache is not suitable for queries containing functions that generate random values (such as `random()`), as this can cause the query results to lose randomness. - -2. Currently, it does not support using cached results of some metrics to satisfy queries for more metrics. For example, cached results for two metrics cannot be used for queries involving three metrics. - -3. By reasonably using SQL Cache, you can significantly improve Doris's query performance, especially in scenarios with low data update frequency. In practical applications, cache parameters need to be adjusted based on specific data characteristics and query patterns to achieve optimal performance improvements. \ No newline at end of file +:::tip Note +- SQL Cache is not suitable for queries containing functions that generate random values (such as `random()`), as this will cause the query results to lose randomness. +- Currently, it does not support using the cached results of some metrics to meet the needs of querying more metrics. For example, the cache for previously queried two metrics cannot be used for the situation of querying three metrics. +- By reasonably using SQL Cache, the query performance of Doris can be significantly improved, especially in scenarios with a low data update frequency. In practical applications, cache parameters need to be adjusted according to specific data characteristics and query patterns to achieve the best performance improvement. + ::: \ No newline at end of file diff --git a/docs/query-acceleration/tuning/tuning-plan/adjusting-join-shuffle.md b/docs/query-acceleration/tuning/tuning-plan/adjusting-join-shuffle.md index 93979be0c4c..c8d1bb7694f 100644 --- a/docs/query-acceleration/tuning/tuning-plan/adjusting-join-shuffle.md +++ b/docs/query-acceleration/tuning/tuning-plan/adjusting-join-shuffle.md @@ -1,3 +1,4 @@ +--- { "title": "Adjusting Join Shuffle Mode with Hint", "language": "en" @@ -40,7 +41,7 @@ SELECT COUNT(*) FROM t2 JOIN [broadcast] t1 ON t1.c1 = t2.c2; SELECT COUNT(*) FROM t2 JOIN [shuffle] t1 ON t1.c1 = t2.c2; ``` -## Case Study +## Case Next, we will demonstrate the usage of Distribute Hints through an example: diff --git a/docs/query-acceleration/tuning/tuning-plan/controlling-hints-with-cbo-rule.md b/docs/query-acceleration/tuning/tuning-plan/controlling-hints-with-cbo-rule.md index f71bb303bee..3b1db76c4b1 100644 --- a/docs/query-acceleration/tuning/tuning-plan/controlling-hints-with-cbo-rule.md +++ b/docs/query-acceleration/tuning/tuning-plan/controlling-hints-with-cbo-rule.md @@ -51,7 +51,7 @@ Currently, the Doris optimizer supports several cost-based rewrites, which can b - PUSH_DOWN_AGG_THROUGH_JOIN_ONE_SIDE - PUSH_DOWN_DISTINCT_THROUGH_JOIN -## Example +## Case Here is a query example: diff --git a/docs/query-acceleration/tuning/tuning-plan/dml-tuning-plan.md b/docs/query-acceleration/tuning/tuning-plan/dml-tuning-plan.md index 418354590ed..f0497ece45b 100644 --- a/docs/query-acceleration/tuning/tuning-plan/dml-tuning-plan.md +++ b/docs/query-acceleration/tuning/tuning-plan/dml-tuning-plan.md @@ -24,98 +24,6 @@ specific language governing permissions and limitations under the License. --> -DML Plan Tuning Includes Both Import Section (INSERT INTO SELECT) and Query Section (CREATE TABLE AS SELECT - CTAS). Next, will introduce the principles and tuning practices of these two parts separately. +For DML plan tuning, it is first necessary to identify whether the performance bottleneck is caused by the import process or the query section. For the troubleshooting and tuning of performance bottlenecks in the query section, please refer to other subsections in [Plan Tuning](optimizing-table-schema.md) for details. -## Loading - -### Principles - -Apache Doris offers a variety of flexible data import solutions to meet data access needs in different scenarios. Doris supports importing data from the following data sources: - -1. Object storage (S3) and HDFS - -2. Local files - -3. Kafka - -4. Relational databases (such as MySQL, PostgreSQL, Oracle, SQLServer, etc.) - -5. Data sources connected via JDBC - -6. JSON format data - -**Doris provides the following main data import methods:** - -1. Broker Load: Imports data from external storage systems through the Broker process - -2. Stream Load: Streams data from local files or in-memory data - -3. Routine Load: Continuously imports data from Kafka - -4. INSERT INTO: Imports data through SQL insert statements - -5. S3 Load: Directly imports data from object storage that supports the S3 protocol - -6. MySQL Load: Uses MySQL clients to import local data - -**Different import methods support slightly different data formats:** - -1. Broker Load: Supports Parquet, ORC, CSV, and GZip formats - -2. Stream Load: Supports CSV, JSON, Parquet, and ORC formats - -3. Routine Load: Supports CSV and JSON formats - -4. MySQL Load: Supports CSV format - -**Data import has the following mechanisms:** - -1. Atomicity guarantee: Each import job acts as a complete transaction, ensuring atomic data writes. - -2. Import label: Each import job is assigned a unique label to ensure At-Most-Once semantics. - -3. Synchronous/Asynchronous mode: Synchronous mode returns results immediately, while asynchronous mode requires separate querying of job status. - -4. Array type support: Array type data can be imported using CAST and array functions. - -5. Execution engine: Users can choose whether to use the Pipeline engine to execute import tasks based on configuration. - -**In practicality, the following considerations are necessary:** - -1. Reasonable selection of import methods: Choose the most appropriate import method for different data sources. - -2. Utilization of label mechanism: Achieve Exactly-Once semantics guarantee. - -3. Proper configuration of parallelism: Adjust the number of parallel imports based on cluster resources. - -4. Monitoring of import status: For asynchronous imports, check the job's progress timely. - -By flexibly using the various import functions provided by Doris, data from various sources can be efficiently imported into Doris for analysis. For more details, please refer to the [Data Loading Overview](../../../data-operate/import/import-way/load-manual). - -### Loading Optimization - -The Pipeline engine is a new query execution engine in Doris, designed to improve the efficiency of queries and data processing. During data import, the Pipeline engine can also be enabled to enhance overall performance. By default, the Pipeline engine is disabled during data import, but users can enable it through relevant configurations. - -To enable the Pipeline engine during data import, configure the following variables: - -**1. FE Configuration Item: enable_pipeline_load** - -- Location: In the configuration file of the FE (Frontend) - -- Function: When enabled, import tasks such as Stream Load will attempt to execute using the Pipeline engine - -**2. Session Variable: enable_nereids_dml_with_pipeline** - -- Location: Set at session level - -- Function: When enabled, INSERT INTO statements will attempt to execute using the Pipeline engine - -**3. Session Variable: enable_pipeline_engine** - -- Location: Set at session level - -- Function: Controls whether the Pipeline engine is actually enabled - -## Querying - -For details, please refer to other sections on [plan tuning](../../../query-acceleration/tuning/tuning-plan/optimizing-table-schema). \ No newline at end of file +Doris supports importing data from multiple data sources. By flexibly utilizing the various import functions provided by Doris, data from various sources can be efficiently imported into Doris for analysis. For details of best practices, please refer to [Import Overview](../../../data-operate/import/load-manual.md). \ No newline at end of file diff --git a/docs/query-acceleration/tuning/tuning-plan/transparent-rewriting-with-async-mv.md b/docs/query-acceleration/tuning/tuning-plan/transparent-rewriting-with-async-mv.md index 6665ddc1949..5b2db7e426d 100644 --- a/docs/query-acceleration/tuning/tuning-plan/transparent-rewriting-with-async-mv.md +++ b/docs/query-acceleration/tuning/tuning-plan/transparent-rewriting-with-async-mv.md @@ -24,202 +24,138 @@ specific language governing permissions and limitations under the License. --> -## Principle +## Overview -The aync-materialized view adopts a transparent rewriting algorithm based on the SPJG (SELECT-PROJECT-JOIN-GROUP-BY) model. This algorithm can analyze the structural information of query SQL, automatically find suitable materialized views, and attempt transparent rewriting to utilize the optimal materialized view to express the query SQL. By using precomputed materialized view results, it can significantly improve query performance and reduce computational costs. +The [multi-table materialized view](../../materialized-view/async-materialized-view/overview.md) adopts a transparent rewriting algorithm based on the SPJG (SELECT-PROJECT-JOIN-GROUP-BY) pattern. This algorithm can analyze the structural information of the query SQL, automatically find the appropriate materialized views, and attempt to perform transparent rewriting to express the query SQL using the optimal materialized views. By using the pre-computed results of materialized views, the [...] -## Tuning Case +## Case -Next, through an example, we will demonstrate in detail how to use aync-materialized views to optimize queries. This example covers a series of operations including the creation, metadata viewing, data refreshing, task management, modification, and deletion of materialized views. +Next, an example will be used to demonstrate in detail how to utilize multi-table materialized views to accelerate queries. -### 1 Creation of Base Tables and Data Insertion +### Create Base Tables -First, create two tables, `orders` and `lineitem`, in the tpch database, and insert the corresponding data. +Firstly, create the tpch database and then create two tables, namely `orders` and `lineitem`, within it, and insert corresponding data. ```sql CREATE DATABASE IF NOT EXISTS tpch; - -USE tpch; - --- Create the orders table -CREATE TABLE IF NOT EXISTS orders ( - o_orderkey integer not null, - o_custkey integer not null, - o_orderstatus char(1) not null, - o_totalprice decimalv3(15,2) not null, - o_orderdate date not null, - o_orderpriority char(15) not null, - o_clerk char(15) not null, - o_shippriority integer not null, - o_comment varchar(79) not null -) -DUPLICATE KEY(o_orderkey, o_custkey) -PARTITION BY RANGE(o_orderdate)( - FROM ('2023-10-17') TO ('2023-10-20') INTERVAL 1 DAY -) -DISTRIBUTED BY HASH(o_orderkey) BUCKETS 3 -PROPERTIES ("replication_num" = "1"); - --- Insert data into the orders table -INSERT INTO orders VALUES - (1, 1, 'o', 99.5, '2023-10-17', 'a', 'b', 1, 'yy'), - (2, 2, 'o', 109.2, '2023-10-18', 'c','d',2, 'mm'), - (3, 3, 'o', 99.5, '2023-10-19', 'a', 'b', 1, 'yy'); - --- Create the lineitem table -CREATE TABLE IF NOT EXISTS lineitem ( - l_orderkey integer not null, - l_partkey integer not null, - l_suppkey integer not null, - l_linenumber integer not null, - l_quantity decimalv3(15,2) not null, - l_extendedprice decimalv3(15,2) not null, - l_discount decimalv3(15,2) not null, - l_tax decimalv3(15,2) not null, - l_returnflag char(1) not null, - l_linestatus char(1) not null, - l_shipdate date not null, - l_commitdate date not null, - l_receiptdate date not null, - l_shipinstruct char(25) not null, - l_shipmode char(10) not null, - l_comment varchar(44) not null -) -DUPLICATE KEY(l_orderkey, l_partkey, l_suppkey, l_linenumber) -PARTITION BY RANGE(l_shipdate) -(FROM ('2023-10-17') TO ('2023-10-20') INTERVAL 1 DAY) -DISTRIBUTED BY HASH(l_orderkey) BUCKETS 3 -PROPERTIES ("replication_num" = "1"); - --- Insert data into the lineitem table -INSERT INTO lineitem VALUES - (1, 2, 3, 4, 5.5, 6.5, 7.5, 8.5, 'o', 'k', '2023-10-17', '2023-10-17', '2023-10-17', 'a', 'b', 'yyyyyyyyy'), - (2, 2, 3, 4, 5.5, 6.5, 7.5, 8.5, 'o', 'k', '2023-10-18', '2023-10-18', '2023-10-18', 'a', 'b', 'yyyyyyyyy'), +USE tpch; + +CREATE TABLE IF NOT EXISTS orders ( + o_orderkey integer not null, + o_custkey integer not null, + o_orderstatus char(1) not null, + o_totalprice decimalv3(15,2) not null, + o_orderdate date not null, + o_orderpriority char(15) not null, + o_clerk char(15) not null, + o_shippriority integer not null, + o_comment varchar(79) not null +) +DUPLICATE KEY(o_orderkey, o_custkey) +PARTITION BY RANGE(o_orderdate)( + FROM ('2023-10-17') TO ('2023-10-20') INTERVAL 1 DAY +) +DISTRIBUTED BY HASH(o_orderkey) BUCKETS 3 +PROPERTIES ("replication_num" = "1"); + +INSERT INTO orders VALUES + (1, 1, 'o', 99.5, '2023-10-17', 'a', 'b', 1, 'yy'), + (2, 2, 'o', 109.2, '2023-10-18', 'c','d',2, 'mm'), + (3, 3, 'o', 99.5, '2023-10-19', 'a', 'b', 1, 'yy'); + +CREATE TABLE IF NOT EXISTS lineitem ( + l_orderkey integer not null, + l_partkey integer not null, + l_suppkey integer not null, + l_linenumber integer not null, + l_quantity decimalv3(15,2) not null, + l_extendedprice decimalv3(15,2) not null, + l_discount decimalv3(15,2) not null, + l_tax decimalv3(15,2) not null, + l_returnflag char(1) not null, + l_linestatus char(1) not null, + l_shipdate date not null, + l_commitdate date not null, + l_receiptdate date not null, + l_shipinstruct char(25) not null, + l_shipmode char(10) not null, + l_comment varchar(44) not null +) +DUPLICATE KEY(l_orderkey, l_partkey, l_suppkey, l_linenumber) +PARTITION BY RANGE(l_shipdate) +(FROM ('2023-10-17') TO ('2023-10-20') INTERVAL 1 DAY) +DISTRIBUTED BY HASH(l_orderkey) BUCKETS 3 +PROPERTIES ("replication_num" = "1"); + +INSERT INTO lineitem VALUES + (1, 2, 3, 4, 5.5, 6.5, 7.5, 8.5, 'o', 'k', '2023-10-17', '2023-10-17', '2023-10-17', 'a', 'b', 'yyyyyyyyy'), + (2, 2, 3, 4, 5.5, 6.5, 7.5, 8.5, 'o', 'k', '2023-10-18', '2023-10-18', '2023-10-18', 'a', 'b', 'yyyyyyyyy'), (3, 2, 3, 6, 7.5, 8.5, 9.5, 10.5, 'k', 'o', '2023-10-19', '2023-10-19', '2023-10-19', 'c', 'd', 'xxxxxxxxx'); ``` -### 2 Creation of Async-Materialized View +### Create an Asynchronous Materialized View -Next, create an asynchronous materialized view `mv1`. +Based on several original tables in the tpch benchmark, create an asynchronous materialized view named `mv1`. ```sql CREATE MATERIALIZED VIEW mv1 -BUILD DEFERRED REFRESH AUTO ON MANUAL +BUILD IMMEDIATE REFRESH COMPLETE ON MANUAL PARTITION BY(l_shipdate) DISTRIBUTED BY RANDOM BUCKETS 2 PROPERTIES ('replication_num' = '1') -AS -SELECT l_shipdate, o_orderdate, l_partkey, l_suppkey, SUM(o_totalprice) AS sum_total -FROM lineitem -LEFT JOIN orders ON lineitem.l_orderkey = orders.o_orderkey AND l_shipdate = o_orderdate -GROUP BY -l_shipdate, -o_orderdate, -l_partkey, + @@ -119,107 +114,55 @@ l_partkey, l_suppkey; ``` -### 3 Viewing Materialized View Metadata - -```sql -SELECT * FROM mv_infos("database"="tpch") WHERE Name="mv1"; -``` - -### 4 Refreshing the Materialized View - -First, view the partition list: - -```sql -SHOW PARTITIONS FROM mv1; -``` - -Then refresh a specific partition: - -```sql -REFRESH MATERIALIZED VIEW mv1 PARTITIONS(p_20231017_20231018); -``` - -### 5 Task Management - -Manage jobs for materialized views, including viewing jobs, pausing scheduled tasks, resuming scheduled tasks, and viewing and canceling tasks. - -- View materialized view jobs - - ```sql - SELECT * FROM jobs("type"="mv") ORDER BY CreateTime; - ``` - -- Pause scheduled tasks for materialized views - - ```sql - PAUSE MATERIALIZED VIEW JOB ON mv1; - ``` -- Resume scheduled tasks for materialized views - - ```sql - RESUME MATERIALIZED VIEW JOB ON mv1; - ``` - -- View materialized view tasks - - ```sql - SELECT * FROM tasks("type"="mv"); - ``` - -- Cancel a materialized view task: assuming `realTaskId` is 123 - - ```sql - CANCEL MATERIALIZED VIEW TASK 123 ON mv1; - ``` - -### 6 Modifying the Materialized View +### Use the Materialized View for Transparent Rewriting ```sql -ALTER MATERIALIZED VIEW mv1 SET("grace_period"="3333"); +mysql> explain shape plan SELECT l_shipdate, SUM(o_totalprice) AS total_price + -> FROM lineitem + -> LEFT JOIN orders ON lineitem.l_orderkey = orders.o_orderkey AND l_shipdate = o_orderdate + -> WHERE l_partkey = 2 AND l_suppkey = 3 + -> GROUP BY l_shipdate; ++-------------------------------------------------------------------+ +| Explain String(Nereids Planner) | ++-------------------------------------------------------------------+ +| PhysicalResultSink | +| --PhysicalDistribute[DistributionSpecGather] | +| ----PhysicalProject | +| ------hashAgg[GLOBAL] | +| --------PhysicalDistribute[DistributionSpecHash] | +| ----------hashAgg[LOCAL] | +| ------------PhysicalProject | +| --------------filter((mv1.l_partkey = 2) and (mv1.l_suppkey = 3)) | +| ----------------PhysicalOlapScan[mv1] | ++-------------------------------------------------------------------+ ``` -### 7 Deleting the Materialized View +It can be seen from the `explain shape plan` that the plan after being transparently rewritten by `mv1` has already hit `mv1`. You can also use `explain` to view the current state of the plan after being rewritten by the materialized view, including whether it has hit and which materialized view has been hit, etc., as shown below: ```sql -DROP MATERIALIZED VIEW mv1; +| ========== MATERIALIZATIONS ========== | +| | +| MaterializedView | +| MaterializedViewRewriteSuccessAndChose: | +| internal.tpch.mv1 chose, | +| | +| MaterializedViewRewriteSuccessButNotChose: | +| not chose: none, | +| | +| MaterializedViewRewriteFail: | ``` -### 8 Querying Using the Materialized View - -- Direct query - - ```sql - SELECT l_shipdate, sum_total - FROM mv1 - WHERE l_partkey = 2 AND l_suppkey = 3; - ``` - -- Query through transparent rewriting (the query optimizer automatically uses the materialized view) - - ```sql - SELECT l_shipdate, SUM(o_totalprice) AS total_price - FROM lineitem - LEFT JOIN orders ON lineitem.l_orderkey = orders.o_orderkey AND l_shipdate = o_orderdate - WHERE l_partkey = 2 AND l_suppkey = 3 - GROUP BY l_shipdate; - ``` - -The above example fully demonstrates the lifecycle of an asynchronous materialized view, including creation, management, usage, and deletion. - ## Summary -By utilizing materialized views, query performance can be significantly enhanced, particularly for complex aggregated queries. Several considerations should be kept in mind when using them: - -1. Precomputed Results: Materialized views precompute and store query results, avoiding the overhead of repeated computations for each query. This is especially effective for complex queries that need to be executed frequently. - -2. Reduction of Join Operations: Materialized views can consolidate data from multiple tables into a single view, reducing the need for join operations during queries and thereby improving query efficiency. - -3. Automatic Updates: When the data in the base tables changes, materialized views can be automatically updated to maintain data consistency. This ensures that query results always reflect the latest data state. - -4. Storage Overhead: Materialized views require additional storage space to save precomputed results. When creating materialized views, a trade-off between query performance improvement and storage space consumption needs to be considered. - -5. Maintenance Cost: The maintenance of materialized views requires certain system resources and time. Frequently updated base tables may result in higher update overhead for materialized views. Therefore, it is necessary to select an appropriate refresh strategy based on actual conditions. +By using multi-table materialized views, the query performance can be significantly improved, especially for complex join and aggregation queries. When using them, the following points need to be noted: -6. Use Cases: Materialized views are suitable for scenarios where data changes infrequently but query frequency is high. For data that changes frequently, real-time computation may be more appropriate. +:::tip Usage Suggestions +- Pre-computed Results: Materialized views pre-compute and store the query results, avoiding the overhead of repeated computations for each query. This is especially effective for complex queries that need to be executed frequently. +- Reducing Join Operations: Materialized views can combine the data of multiple tables into one view, reducing the join operations during queries and thus improving query efficiency. +- Automatic Updates: When the data in the base tables changes, materialized views can be updated automatically to maintain data consistency. This ensures that the query results always reflect the latest data status. +- Space Overhead: Materialized views require additional storage space to save the pre-computed results. When creating materialized views, it is necessary to balance the improvement in query performance and the consumption of storage space. +- Maintenance Cost: The maintenance of materialized views requires certain system resources and time. Base tables that are updated frequently may lead to relatively high update overheads for materialized views. Therefore, it is necessary to choose an appropriate refresh strategy according to the actual situation. +- Applicable Scenarios: Materialized views are suitable for scenarios where the data change frequency is low and the query frequency is high. For frequently changing data, real-time computation may be more appropriate. + ::: -The rational use of materialized views can significantly improve database query performance, especially in the case of complex queries and large data volumes. At the same time, it is also necessary to comprehensively consider factors such as storage and maintenance to achieve a balance between performance and cost. \ No newline at end of file +Reasonable utilization of multi-table materialized views can significantly improve the query performance of the database, especially in the case of complex queries and large data volumes. Meanwhile, factors such as storage and maintenance also need to be considered comprehensively to achieve a balance between performance and cost. \ No newline at end of file diff --git a/docs/query-acceleration/tuning/tuning-plan/transparent-rewriting-with-sync-mv.md b/docs/query-acceleration/tuning/tuning-plan/transparent-rewriting-with-sync-mv.md index 564471a2772..4cba281c363 100644 --- a/docs/query-acceleration/tuning/tuning-plan/transparent-rewriting-with-sync-mv.md +++ b/docs/query-acceleration/tuning/tuning-plan/transparent-rewriting-with-sync-mv.md @@ -1,7 +1,7 @@ --- { - "title": "Transparent Rewriting with Sync-Materialized View", - "language": "en" + "title": "Transparent Rewriting with Single-table Materialized View", + "language": "en" } --- @@ -24,73 +24,72 @@ specific language governing permissions and limitations under the License. --> -## Principle +## Overview -A sync-materialized view is a special type of table that precomputes and stores data based on a predefined SELECT statement. Its primary purpose is to satisfy users' analytical needs for arbitrary dimensions of raw detailed data while also enabling rapid fixed-dimension analytical queries. +The [Materialized View](../../materialized-view/sync-materialized-view.md) is a special kind of table that pre-computes and stores data according to a predefined SELECT statement. Its main purpose is to meet users' needs for analyzing raw detailed data from any dimension and also enable quick analysis and queries on fixed dimensions. -Materialized views are suitable for the following scenarios: +The applicable scenarios for materialized views are as follows: -1. Analytical requirements cover both detailed data queries and fixed-dimension queries. +1. The analysis requirements cover both detailed data queries and fixed-dimension queries. +2. The queries only involve a small number of columns or rows in the table. +3. The queries include time-consuming processing operations, such as long-duration aggregation operations, etc. +4. The queries need to match different prefix indexes. -2. Queries only involve a small subset of columns or rows in the table. +For queries that frequently and repeatedly use the results of the same subqueries, single-table synchronous materialized views can significantly improve performance. Doris will automatically maintain the data of materialized views to ensure data consistency between the base table and the materialized view table, without requiring additional manual maintenance costs. During a query, the system will automatically match the optimal materialized view and directly read data from it. -3. Queries contain time-consuming processing operations, such as long aggregation operations. +:::tip Precautions +- In Doris 2.0 and subsequent versions, materialized views have some enhanced functions. It is recommended that users confirm in the test environment whether the expected queries can hit the materialized views they want to create before using materialized views in the formal production environment. +- It is not recommended to create multiple materialized views with similar forms on the same table, as this may lead to conflicts among multiple materialized views and thus cause query hit failures. + ::: -4. Queries require matching different prefix indexes. +## Case -For queries that frequently reuse the same subquery results, a sync-materialized view can significantly enhance performance. Doris automatically maintains the data in the materialized view, ensuring data consistency between the base table and the materialized view without additional manual maintenance costs. During queries, the system automatically matches the optimal materialized view and reads data directly from it. +The following uses a specific example to demonstrate the process of using single-table materialized views to accelerate queries: -## Tuning Usage Case +Suppose we have a detailed sales record table named `sales_records`, which records various pieces of information for each transaction in detail, including the transaction ID, salesperson ID, selling store ID, sales date, and transaction amount. Now, we often need to conduct analysis and queries on the sales volume of different stores. -The following is a specific example to illustrate the use of single-table materialized views: - -Suppose we have a detailed sales record table `sales_records` that records various information for each transaction, including transaction ID, salesperson ID, store ID, sales date, and transaction amount. Now, we frequently need to perform analytical queries on sales volumes for different stores. - -To optimize the performance of these queries, we can create a materialized view `store_amt` that groups by store ID and sums the sales amounts for the same store. The specific steps are as follows: +To optimize the performance of these queries, we can create a materialized view named `store_amt`. This view groups by the selling store and sums up the sales amounts of the same store. The specific steps are as follows: ### Create a Materialized View -First, we use the following SQL statement to create the materialized view `store_amt`: +Firstly, we use the following SQL statement to create the materialized view `store_amt`: ```sql CREATE MATERIALIZED VIEW store_amt AS -SELECT store_id, SUM(sale_amt) -FROM sales_records + @@ -59,38 +61,36 @@ FROM sales_records GROUP BY store_id; ``` -After submitting the creation task, Doris will asynchronously build this materialized view in the background. We can view the creation progress of the materialized view through the following command: +After submitting the creation task, Doris will build this materialized view asynchronously in the background. We can use the following command to check the creation progress of the materialized view: ```sql -SHOW ALTER TABLE MATERIALIZED VIEW FROM db_name; +SHOW ALTER TABLE MATERIALIZED VIEW FROM db_name; ``` -When the `State` field changes to `FINISHED`, it indicates that the `store_amt` materialized view has been successfully created. - -### Query Data +When the `State` field becomes `FINISHED`, it means that the `store_amt` materialized view has been successfully created. -After the materialized view is created, when we query the sales volumes of different stores, Doris will automatically match the `store_amt` materialized view and read the pre-aggregated data directly from it, significantly improving query efficiency. +### Transparent Rewriting -The query statement is as follows: +After the materialized view is created, when we query the sales volume of different stores, Doris will automatically match the `store_amt` materialized view and directly read the pre-aggregated data from it, thus significantly improving the query efficiency. The query statement is as follows: ```sql SELECT store_id, SUM(sale_amt) FROM sales_records GROUP BY store_id; ``` -We can also use the `EXPLAIN` command to check whether the query successfully hits the materialized view: +We can also use the `EXPLAIN` command to check whether the query has successfully hit the materialized view: ```sql EXPLAIN SELECT store_id, SUM(sale_amt) FROM sales_records GROUP BY store_id; ``` -At the end of the execution plan, if similar content is displayed, it indicates that the query successfully hits the `store_amt` materialized view: +At the end of the execution plan, if something like the following is displayed, it means that the query has successfully hit the `store_amt` materialized view: ```sql TABLE: default_cluster:test.sales_records(store_amt), PREAGGREGATION: ON ``` -By following these steps, we can utilize single-table materialized views to optimize query performance and improve the efficiency of data analysis. +Through the above steps, we can use single-table materialized views to optimize query performance and improve the efficiency of data analysis. ## Summary -By creating materialized views, we can significantly enhance the query speed for related aggregation analyses. Materialized views not only enable us to perform statistical analyses quickly but also flexibly support the query requirements of detailed data, making them a very powerful feature in Doris. \ No newline at end of file +By creating single-table materialized views, we can significantly improve the query speed for relevant aggregation analysis. Single-table materialized views not only enable us to conduct statistical analysis quickly but also flexibly support the query needs for detailed data, which is a very powerful feature in Doris. \ No newline at end of file --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org