This is an automated email from the ASF dual-hosted git repository. luzhijing pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 989e44ea7e3 [doc](mtmv) Add doc for materialized view rewrite new feature and add FAQ (#717) 989e44ea7e3 is described below commit 989e44ea7e38a1987d40d2eed75dc516d4cdf424 Author: seawinde <149132972+seawi...@users.noreply.github.com> AuthorDate: Wed Jun 26 10:34:02 2024 +0800 [doc](mtmv) Add doc for materialized view rewrite new feature and add FAQ (#717) Add doc for materialized view rewrite new feature and add FAQ --------- Co-authored-by: Luzhijing <82810928+luzhij...@users.noreply.github.com> --- .../query-async-materialized-view.md | 433 +++++++++++++++++-- .../Create/CREATE-ASYNC-MATERIALIZED-VIEW.md | 17 +- .../query-async-materialized-view.md | 426 +++++++++++++++--- .../Create/CREATE-ASYNC-MATERIALIZED-VIEW.md | 84 ++-- .../query-async-materialized-view.md | 416 +++++++++++++++--- .../Create/CREATE-ASYNC-MATERIALIZED-VIEW.md | 115 ++--- .../query-async-materialized-view.md | 481 ++++++++++++++++++--- .../Create/CREATE-ASYNC-MATERIALIZED-VIEW.md | 18 +- 8 files changed, 1625 insertions(+), 365 deletions(-) diff --git a/docs/query/view-materialized-view/query-async-materialized-view.md b/docs/query/view-materialized-view/query-async-materialized-view.md index 9514a5927c7..09a88a8a6ad 100644 --- a/docs/query/view-materialized-view/query-async-materialized-view.md +++ b/docs/query/view-materialized-view/query-async-materialized-view.md @@ -58,7 +58,7 @@ CREATE TABLE IF NOT EXISTS lineitem ( ) DUPLICATE KEY(l_orderkey, l_partkey, l_suppkey, l_linenumber) PARTITION BY RANGE(l_shipdate) -(FROM ('2023-10-17') TO ('2023-10-20') INTERVAL 1 DAY) +(FROM ('2023-10-17') TO ('2023-11-01') INTERVAL 1 DAY) DISTRIBUTED BY HASH(l_orderkey) BUCKETS 3 PROPERTIES ("replication_num" = "1"); @@ -81,7 +81,7 @@ CREATE TABLE IF NOT EXISTS orders ( ) DUPLICATE KEY(o_orderkey, o_custkey) PARTITION BY RANGE(o_orderdate)( - FROM ('2023-10-17') TO ('2023-10-20') INTERVAL 1 DAY) + FROM ('2023-10-17') TO ('2023-11-01') INTERVAL 1 DAY) DISTRIBUTED BY HASH(o_orderkey) BUCKETS 3 PROPERTIES ("replication_num" = "1"); @@ -356,6 +356,8 @@ Temporary support for the aggregation roll-up functions is as follows: | count(distinct ) | bitmap_union | bitmap_union_count | | bitmap_union | bitmap_union | bitmap_union | | bitmap_union_count | bitmap_union | bitmap_union_count | +| hll_union_agg, approx_count_distinct, hll_cardinality | hll_union 或者 hll_raw_agg | hll_union_agg | + ## Query partial Transparent Rewriting (Coming soon) When the number of tables in the materialized view is greater than the query, if the materialized view @@ -392,9 +394,8 @@ Query statement: LEFT OUTER JOIN orders ON L_ORDERKEY = O_ORDERKEY; ``` -## Union Rewriting (Coming soon) -When the materialized view is not sufficient to provide all the data for the query, it can use Union to return -data by combining the original table and the materialized view. +## Union Rewriting +When a materialized view is insufficient to provide all the data required by a query, a UNION ALL approach can be used to combine data from both the original table and the materialized view for the final result. Currently, the materialized view needs to be a partitioned materialized view, and UNION ALL can be used to supplement the data by applying the filter conditions on the partition fields. For example: **Case 1** @@ -403,44 +404,125 @@ Materialized view definition: ```sql CREATE MATERIALIZED VIEW mv7 -BUILD IMMEDIATE REFRESH AUTO ON SCHEDULE EVERY 1 hour -DISTRIBUTED BY RANDOM BUCKETS 3 -PROPERTIES ('replication_num' = '1') -AS -SELECT - o_orderkey, - o_custkey, - o_orderstatus, - o_totalprice -FROM orders -WHERE o_orderkey > 10; +BUILD IMMEDIATE REFRESH AUTO ON MANUAL +partition by(l_shipdate) +DISTRIBUTED BY RANDOM BUCKETS 2 +PROPERTIES ('replication_num' = '1') +as +select l_shipdate, o_orderdate, l_partkey, + l_suppkey, sum(o_totalprice) as sum_total +from lineitem + left join orders on lineitem.l_orderkey = orders.o_orderkey and l_shipdate = o_orderdate +group by + l_shipdate, + o_orderdate, + l_partkey, + l_suppkey; ``` +When a new partition 2023-10-21 is added to the base table and the materialized view has not yet been refreshed, the result can be returned by combining the materialized view with the original table using UNION ALL. + +```sql +insert into lineitem values + (1, 2, 3, 4, 5.5, 6.5, 7.5, 8.5, 'o', 'k', '2023-10-21', '2023-10-21', '2023-10-21', 'a', 'b', 'yyyyyyyyy'); +``` + + Query statement: ```sql -SELECT - o_orderkey, - o_custkey, - o_orderstatus, - o_totalprice -FROM orders -WHERE o_orderkey > 5; +select l_shipdate, o_orderdate, l_partkey, l_suppkey, sum(o_totalprice) as sum_total +from lineitem + left join orders on lineitem.l_orderkey = orders.o_orderkey and l_shipdate = o_orderdate +group by + l_shipdate, + o_orderdate, + l_partkey, + l_suppkey; ``` Rewriting result: ```sql SELECT * -FROM mv -UNION ALL -SELECT - o_orderkey, - o_custkey, - o_orderstatus, - o_totalprice -FROM orders -WHERE o_orderkey > 5 AND o_orderkey <= 10; +FROM mv7 +union all +select t1.l_shipdate, o_orderdate, t1.l_partkey, t1.l_suppkey, sum(o_totalprice) as sum_total +from (select * from lineitem where l_shipdate = '2023-10-21') t1 + left join orders on t1.l_orderkey = orders.o_orderkey and t1.l_shipdate = o_orderdate +group by + t1.l_shipdate, + o_orderdate, + t1.l_partkey, + t1.l_suppkey; ``` +Noted: +The materialized view includes a WHERE condition. For example, if the materialized view is constructed with the filter condition WHERE l_shipdate > '2023-10-19' and the query condition is WHERE l_shipdate > '2023-10-18', this situation currently cannot be compensated for using UNION ALL. This will be supported in the future. + +## Nested Materialized View Rewrite +The definition SQL of a materialized view can use another materialized view; this type of materialized view is called a nested materialized view. Theoretically, there is no limit to the number of nested layers. This materialized view can be queried directly or can participate in transparent rewrite operations. Nested materialized views can also be involved in transparent rewrites. + +**Case 1** + +Here is an example to illustrate how nested materialized views work: + +First, create the inner materialized view `mv8_0_inner_mv`. + +```sql +CREATE MATERIALIZED VIEW mv8_0_inner_mv +BUILD IMMEDIATE REFRESH COMPLETE ON MANUAL +DISTRIBUTED BY RANDOM BUCKETS 2 +PROPERTIES ('replication_num' = '1') +AS +select +l_linenumber, +o_custkey, +o_orderkey, +o_orderstatus, +l_partkey, +l_suppkey, +l_orderkey +from lineitem +inner join orders on lineitem.l_orderkey = orders.o_orderkey; +``` + +Create the outer materialized view `mv8_0`. + +```sql +CREATE MATERIALIZED VIEW mv8_0 +BUILD IMMEDIATE REFRESH COMPLETE ON MANUAL +DISTRIBUTED BY RANDOM BUCKETS 2 +PROPERTIES ('replication_num' = '1') +AS +select +l_linenumber, +o_custkey, +o_orderkey, +o_orderstatus, +l_partkey, +l_suppkey, +l_orderkey, +ps_availqty +from mv8_0_inner_mv +inner join partsupp on l_partkey = ps_partkey AND l_suppkey = ps_suppkey; +``` + +The following query will result in successful rewrite for both mv8_0_inner_mv and mv8_0. Ultimately, the cost-based optimizer will select mv8_0. + +```sql +select lineitem.l_linenumber +from lineitem +inner join orders on l_orderkey = o_orderkey +inner join partsupp on l_partkey = ps_partkey AND l_suppkey = ps_suppkey +where o_orderstatus = 'o' +``` + +**Note:** + +1. The more layers of nested materialized views, the longer the time it will take for transparent rewriting. It is recommended not to exceed 3 layers of nested materialized views. +2. Transparent rewriting of nested materialized views is disabled by default. See the switch below for enabling it. + + + ## Auxiliary Functions **Data Consistency Issues After Transparent Rewriting** @@ -511,27 +593,290 @@ you can execute the following statement. It will provide a detailed breakdown of | SET enable_materialized_view_rewrite = true; | Enable or disable query transparent rewriting, default is disabled | | SET materialized_view_rewrite_enable_contain_external_table = true; | Whether materialized views participating in transparent rewriting are allowed to contain external tables, default is not allowed | | SET materialized_view_rewrite_success_candidate_num = 3; | Transparently rewrites the successful result set, allowing the maximum number of CBO candidates to participate, the default is 3 | - +| SET enable_materialized_view_union_rewrite = true; | Whether to allow the union of base table and materialized view using UNION ALL when the partitioned materialized view is insufficient to provide all the data required by a query. Default is enabled. | +| SET enable_materialized_view_nest_rewrite = true; | Whether to allow nested rewriting. Default is disabled. | +| SET materialized_view_relation_mapping_max_count = 8; | Maximum number of relation mappings allowed during transparent rewrite. If exceeded, truncation will occur. Relation mapping is typically generated by self-joins in tables and the number is usually the Cartesian product, for example, if there are 3 tables, it may generate 8 combinations. Default is 8. | ## Limitations -- The materialized view definition statement only allows SELECT, FROM, WHERE, JOIN, and GROUP BY clauses. - The input for JOIN can include simple GROUP BY (aggregation on a single table). - Supported types of JOIN operations include INNER and LEFT OUTER JOIN. - Support for other types of JOIN operations will be gradually added. +- Materialized view definition statements are only allowed to include SELECT, FROM, WHERE, JOIN, and GROUP BY clauses. +- The input to JOIN can include simple GROUP BY (single-table aggregation). Supported JOIN types include INNER, LEFT OUTER JOIN, +- RIGHT OUTER JOIN, FULL OUTER JOIN, LEFT SEMI JOIN, RIGHT SEMI JOIN, LEFT ANTI JOIN, and RIGHT ANTI JOIN. -- Materialized views based on External Tables do not guarantee strong consistency in query results. +- Materialized views based on External Tables do not guarantee strong consistency of query results. -- The use of non-deterministic functions to build materialized views is not supported, - including rand, now, current_time, current_date, random, uuid, etc. +- The use of non-deterministic functions to construct materialized views is not supported, including rand, now, current_time, +- current_date, random, uuid, etc. - Transparent rewriting does not support window functions. -- There is LIMIT in queries and materialized views, and transparent rewriting is not supported for the time being. +- Materialized views with LIMIT are currently not supported for transparent rewriting. - Currently, materialized view definitions cannot utilize views or other materialized views. - When the query or materialized view has no data, transparent rewriting is not supported. -- Currently, WHERE clause compensation supports scenarios where the materialized view does not have a WHERE clause, - but the query does, or where the materialized view has a WHERE clause and the query's WHERE clause is a superset - of the materialized view's. Range condition compensation is not yet supported but will be added gradually. +- Currently, WHERE condition compensation only supports compensating conditions on numeric and date type columns. +For example, if the materialized view is defined as a > 5 and the query is a > 10, transparent rewriting is supported. + +## Frequently Asked Questions + +### 1. Why isn't the materialized view being used? +To determine why a materialized view is not being used, execute the following SQL: + +`explain your_query_sql;` + +a. The transparent rewriting feature for materialized views is disabled by default. You need to enable the corresponding switch for it to work. See the related switches for asynchronous materialized views. + +b. The materialized view may not be available, causing transparent rewriting to fail. Check the status of materialized view construction, see problem 2. + +c. After checking the first two steps, if the materialized view is still not being used, it may be because the definition SQL of the materialized view and the query SQL are not within the current capability of the materialized view rewriting. See the capabilities of materialized view transparent rewriting. + +### 2. How to check if the materialized view status is normal? +#### 2.1 Confirm the Materialized View Construction Status +To participate in transparent rewriting, the status of the materialized view must be Success. First, run the following SQL to check the JobName of the materialized view: + +`select * from mv_infos('database'='db_name') where Name = 'mv_name'` + +Next, use the JobName to check the task status of the materialized view. Run the following SQL: + +`select * from tasks("type"="mv") where JobName = 'job_name';` + +Check if the status of the most recent task execution is Success. + +#### 2.2 Confirm the Availability of Consistent Materialized View Data +If the materialized view is successfully built but is unavailable due to data changes and the `grace_period` setting, +confirm the availability of consistent materialized view data. + +**For Full Refresh Materialized Views:** + +Run the following SQL and check if the `SyncWithBaseTables` field is 1: + +`select * from mv_infos('database'='db_name') where Name = 'mv_name'` + +**For Partitioned Materialized Views:** + +Run the following SQL to check if the partitions used in the query are valid: +`show partitions from mv_name;` + +### 3. Error During Materialized View Construction +Error Message: + +`ERROR 1105 (HY000): errCode = 2, detailMessage = Syntax error in line 1: +BUILD IMMEDIATE REFRESH AUTO ON MANUAL` + +a. The statement for asynchronous materialized views is only supported under the new optimizer. +Ensure that the new optimizer is enabled: + +`SET global enable_nereids_planner = true;` + +b. It's possible that there is a typo in the keywords used in the statement to build the materialized view, or there +may be syntax errors in the materialized view definition SQL. Check the materialized view definition SQL and the create +materialized view statement to ensure correctness. + +### 4. Error: Unable to Find a Suitable Base Table for Partitioning +This error typically indicates that the SQL definition of the materialized view and the selection of partitioning fields +for the materialized view are not suitable for partitioned incremental updates. Therefore, creating a partitioned +materialized view will result in this error. + +For a materialized view to support partitioned incremental updates, it needs to meet certain requirements. +For more details, refer to the [CREATE ASYNC MATERIALIZED VIEW documentation](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md). + +Here is an example to illustrate the construction of a partitioned materialized view: + +```sql + CREATE TABLE IF NOT EXISTS lineitem ( + l_orderkey INTEGER NOT NULL, + l_partkey INTEGER NOT NULL, + l_suppkey INTEGER NOT NULL, + l_linenumber INTEGER NOT NULL, + l_quantity DECIMALV3(15,2) NOT NULL, + l_extendedprice DECIMALV3(15,2) NOT NULL, + l_discount DECIMALV3(15,2) NOT NULL, + l_tax DECIMALV3(15,2) NOT NULL, + l_returnflag CHAR(1) NOT NULL, + l_linestatus CHAR(1) NOT NULL, + l_shipdate DATE NOT NULL, + l_commitdate DATE NOT NULL, + l_receiptdate DATE NOT NULL, + l_shipinstruct CHAR(25) NOT NULL, + l_shipmode CHAR(10) NOT NULL, + l_comment VARCHAR(44) NOT NULL + ) + DUPLICATE KEY(l_orderkey, l_partkey, l_suppkey, l_linenumber) + PARTITION BY RANGE(l_shipdate) ( + PARTITION `day_1` VALUES LESS THAN ('2023-12-9'), + PARTITION `day_2` VALUES LESS THAN ("2023-12-11"), + PARTITION `day_3` VALUES LESS THAN ("2023-12-30")) + DISTRIBUTED BY HASH(l_orderkey) BUCKETS 3 + PROPERTIES ( + "replication_num" = "1" + ); +``` + +```sql + CREATE TABLE IF NOT EXISTS orders ( + o_orderkey INTEGER NOT NULL, + o_custkey INTEGER NOT NULL, + o_orderstatus CHAR(1) NOT NULL, + o_totalprice DECIMALV3(15,2) NOT NULL, + o_orderdate DATE NOT NULL, + o_orderpriority CHAR(15) NOT NULL, + o_clerk CHAR(15) NOT NULL, + o_shippriority INTEGER NOT NULL, + O_COMMENT VARCHAR(79) NOT NULL + ) + DUPLICATE KEY(o_orderkey, o_custkey) + PARTITION BY RANGE(o_orderdate) ( + PARTITION `day_2` VALUES LESS THAN ('2023-12-9'), + PARTITION `day_3` VALUES LESS THAN ("2023-12-11"), + PARTITION `day_4` VALUES LESS THAN ("2023-12-30") + ) + DISTRIBUTED BY HASH(o_orderkey) BUCKETS 3 + PROPERTIES ( + "replication_num" = "1" + ); +``` +If l_shipdate is the partition field of the base table lineitem, the following materialized view can be incrementally +updated by partition. + +```sql +CREATE MATERIALIZED VIEW mv9 +BUILD IMMEDIATE REFRESH AUTO ON MANUAL +partition by(l_shipdate) +DISTRIBUTED BY HASH(l_orderkey) BUCKETS 10 +PROPERTIES ('replication_num' = '1') +AS +SELECT l_shipdate, l_orderkey, O_ORDERDATE, + count(O_ORDERDATE) over (partition by l_shipdate order by l_orderkey) as window_count +FROM lineitem +LEFT OUTER JOIN orders on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_orderkey, O_ORDERDATE; +``` + +The following materialized view cannot be incrementally updated by partition because l_shipdate is generated from the +right side of a LEFT OUTER JOIN and may produce null values. + +```sql +CREATE MATERIALIZED VIEW mv10 +BUILD IMMEDIATE REFRESH AUTO ON MANUAL +partition by(l_shipdate) +DISTRIBUTED BY HASH(l_orderkey) BUCKETS 10 +PROPERTIES ('replication_num' = '1') +AS +SELECT l_shipdate, l_orderkey, O_ORDERDATE, + count(O_ORDERDATE) over (partition by l_shipdate order by l_orderkey) as window_count +FROM orders +LEFT OUTER JOIN lineitem on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_orderkey, O_ORDERDATE; +``` + +### 5. The materialized view returns no data upon direct query? + The materialized view might still be under construction or the construction process might have failed. Use the following +statement to check the status of materialized view construction: + +```sql + -- 查看物化视图元数据信息,database 为当前数据库, mv_name 为物化视图名称 + select * from mv_infos('database'='db_name') where Name = 'mv_name' \G +``` + +```sql +-- 查看任务元数据 +select * from jobs("type"="mv") order by CreateTime limit 5; +``` +```sql +-- 查看任务执行信息,这里面会展示任务执行的状态,如果失败会有失败原因 +select * from tasks("type"="mv") where JobName = 'job_name'; +``` + +### 6. What happens when the data in the base tables used by the materialized view changes before the materialized view is refreshed? +The timeliness of data in asynchronous materialized views has a certain delay compared to the base tables. + +For internal tables and external tables that can perceive data changes (such as Hive tables), +whether a materialized view can be used for transparent rewriting when the data in the base tables changes +before the materialized view is refreshed depends on the threshold set by `grace_period`. + +`grace_period` refers to the time allowance for the materialized view to be inconsistent with the data from the base tables. + +For example, if `grace_period` is set to 0, it means that the materialized view must be consistent with the data in the base +tables before it can be used for transparent rewriting. + +For external tables (except Hive tables), since data changes cannot be perceived, the materialized view can be used for transparent rewriting regardless of whether the data in the external tables is up-to-date (in this case, data inconsistency may occur). + +If `grace_period` is set to 10, it means that the materialized view and the base table data are allowed to have a delay of 10 seconds. If there is a delay in the data of the materialized view compared to the base tables within 10 seconds, the materialized view can still be used for transparent rewriting. + +For partitioned materialized views, if some partitions become invalid, there are two scenarios: + +a. If the query does not use data from the invalid partitions, the materialized view can still be used for transparent rewriting. + +b. If the query uses data from the invalid partitions and the data is within the grace_period, the materialized view can still be used. If the data in the materialized view is not within the grace_period, the query can be responded to by using UNION ALL with the original table and the materialized view. + +### 7. How to confirm if the materialized view is hit, and how to check the reason if it's not hit? + +You can use the `explain query_sql` command to see a summary of whether the materialized view is hit or not. For example, +consider the following materialized view: + +```sql +CREATE MATERIALIZED VIEW mv11 +BUILD IMMEDIATE REFRESH AUTO ON MANUAL +partition by(l_shipdate) +DISTRIBUTED BY HASH(l_orderkey) BUCKETS 10 +PROPERTIES ('replication_num' = '1') +AS +SELECT l_shipdate, l_orderkey, O_ORDERDATE, count(*) +FROM lineitem +LEFT OUTER JOIN orders on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_orderkey, O_ORDERDATE; +``` + +Now, let's analyze the following query: + +```sql +explain +SELECT l_shipdate, l_linestatus, O_ORDERDATE, count(*) +FROM orders +LEFT OUTER JOIN lineitem on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_linestatus, O_ORDERDATE; +``` + +The explain command will show information about whether the materialized view is hit or not. +If it's not hit, it will provide a summary of the failure reason. For example: +```text +| MaterializedView | +| MaterializedViewRewriteSuccessAndChose: | +| | +| MaterializedViewRewriteSuccessButNotChose: | +| | +| MaterializedViewRewriteFail: | +| Name: internal#doc_test#mv11 | +| FailSummary: View struct info is invalid, The graph logic between query and view is not consistent +``` + +In this case, the failure reason is `The graph logic between query and view is not consistent`, +which means that the join order in the query is not consistent with the join order in the materialized view. + +Let's consider another query: + +```sql +explain +SELECT l_shipdate, l_linestatus, O_ORDERDATE, count(*) +FROM lineitem +LEFT OUTER JOIN orders on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_linestatus, O_ORDERDATE; +``` + +If this query also fails to hit the materialized view, the summary might be: + +```text +| MaterializedView | +| MaterializedViewRewriteSuccessAndChose: | +| | +| MaterializedViewRewriteSuccessButNotChose: | +| | +| MaterializedViewRewriteFail: | +| Name: internal#doc_test#mv11 | +| FailSummary: View struct info is invalid, View dimensions doesn't not cover the query dimensions +``` + +In this case, the failure reason is `View dimensions doesn't not cover the query dimensions`, indicating that +the fields used in the `GROUP BY` clause of the query cannot be obtained from the `GROUP BY` clause +of the materialized view. \ No newline at end of file diff --git a/docs/sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md b/docs/sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md index f1fd88afd57..9121b07d6b5 100644 --- a/docs/sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md +++ b/docs/sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md @@ -101,15 +101,14 @@ AUTO:Try to refresh incrementally as much as possible. If incremental refresh The SQL definition and partition fields of the materialized view need to meet the following conditions for partition incremental updates: -- At least one of the base tables used by the materialized view must be a partitioned table. -- The partitioned tables used by the materialized view must employ list or range partitioning strategies. -- The top-level partition column in the materialized view can only have one partition field. -- The SQL of the materialized view needs to use partition columns from the base table, such as after the SELECT clause. -- If GROUP BY is used, the partition column fields must be after the GROUP BY. -- If window functions are used, the partition column fields must be after the PARTITION BY. -- Data changes should occur on partitioned tables. If they occur on non-partitioned tables, the materialized view needs to be fully rebuilt. -- Using the fields that generate nulls in the JOIN as partition fields in the materialized view prohibits partition incremental updates. -- The null attribute of partition fields in the base table used by the materialized view must not be empty if it comes from an internal table. If it comes from an external table in Hive, the null attribute of the base table can be empty. +- At least one of the base tables used by the materialized view is a partitioned table. +- The base tables used by the materialized view must use list or range partitioning strategies. +- The `partition by` clause in the SQL definition of the materialized view can only have one partitioning column. +- The partitioning column specified in the `partition by` clause of the materialized view's SQL must come after the `SELECT` statement. +- If the materialized view's SQL includes a `group by` clause, the columns used for partitioning must come after the `group by` clause. +- If the materialized view's SQL includes window functions, the columns used for partitioning must come after the `partition by` clause. +- Data changes should occur on partitioned tables. If data changes occur on non-partitioned tables, the materialized view needs to be fully rebuilt. +- If a field from the null-generating side of a join is used as a partitioning column for the materialized view, it cannot be incrementally updated by partition. For example, for a LEFT OUTER JOIN, the partitioning column must be on the left side, not the right. ```sql refreshMethod diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/query/view-materialized-view/query-async-materialized-view.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/query/view-materialized-view/query-async-materialized-view.md index d42d796a4e8..b636d78810c 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/query/view-materialized-view/query-async-materialized-view.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/query/view-materialized-view/query-async-materialized-view.md @@ -35,11 +35,11 @@ Doris 可以分析查询 SQL 的结构信息,自动寻找满足要求的物化 表的定义如下: ```sql CREATE TABLE IF NOT EXISTS lineitem ( - l_orderkey integer not null, - l_partkey integer not null, - l_suppkey integer not null, - l_linenumber integer not null, - l_quantity decimalv3(15,2) not null, + l_orderkey integer not null, + l_partkey integer not null, + l_suppkey integer not null, + l_linenumber integer not null, + l_quantity decimalv3(15,2) not null, l_extendedprice decimalv3(15,2) not null, l_discount decimalv3(15,2) not null, l_tax decimalv3(15,2) not null, @@ -54,7 +54,7 @@ CREATE TABLE IF NOT EXISTS lineitem ( ) DUPLICATE KEY(l_orderkey, l_partkey, l_suppkey, l_linenumber) PARTITION BY RANGE(l_shipdate) -(FROM ('2023-10-17') TO ('2023-10-20') INTERVAL 1 DAY) +(FROM ('2023-10-17') TO ('2023-11-01') INTERVAL 1 DAY) DISTRIBUTED BY HASH(l_orderkey) BUCKETS 3 PROPERTIES ("replication_num" = "1"); @@ -77,7 +77,7 @@ CREATE TABLE IF NOT EXISTS orders ( ) DUPLICATE KEY(o_orderkey, o_custkey) PARTITION BY RANGE(o_orderdate)( - FROM ('2023-10-17') TO ('2023-10-20') INTERVAL 1 DAY) + FROM ('2023-10-17') TO ('2023-11-01') INTERVAL 1 DAY) DISTRIBUTED BY HASH(o_orderkey) BUCKETS 3 PROPERTIES ("replication_num" = "1"); @@ -325,15 +325,16 @@ l_suppkey; 暂时目前支持的聚合上卷函数列表如下: -| 查询中函数 | 物化视图中函数 | 函数上卷后 | -|--------------------|---------------|---------------------| -| max | max | max | -| min | min | min | -| sum | sum | sum | -| count | count | sum | -| count(distinct ) | bitmap_union | bitmap_union_count | -| bitmap_union | bitmap_union | bitmap_union | -| bitmap_union_count | bitmap_union | bitmap_union_count | +| 查询中函数 | 物化视图中函数 | 函数上卷后 | +|-------------------------------------------------------|-----------------------------|--------------------| +| max | max | max | +| min | min | min | +| sum | sum | sum | +| count | count | sum | +| count(distinct ) | bitmap_union | bitmap_union_count | +| bitmap_union | bitmap_union | bitmap_union | +| bitmap_union_count | bitmap_union | bitmap_union_count | +| hll_union_agg, approx_count_distinct, hll_cardinality | hll_union 或者 hll_raw_agg | hll_union_agg | ## Query partial 透明改写(Coming soon) 当物化视图的表比查询多时,如果物化视图比查询多的表满足 JOIN 消除的条件,那么也可以进行透明改写,如下可以进行透明改写,待支持。 @@ -367,52 +368,128 @@ mv 定义: LEFT OUTER JOIN orders ON L_ORDERKEY = O_ORDERKEY; ``` -## Union 改写(Coming soon) -当物化视图不足以提供查询的所有数据时,可以通过 Union 的方式,将查询原表和物化视图 Union 起来返回数据,如下可以进行透明改写,待支持。 +## Union 改写 +当物化视图不足以提供查询的所有数据时,可以使用 union all 的方式,将查询原表和物化视图的数据结合作为最终返回结果。 +目前需要物化视图是分区物化视图,可以对分区字段的过滤条件使用 union all 补全数据。 **用例 1** mv 定义: ```sql CREATE MATERIALIZED VIEW mv7 -BUILD IMMEDIATE REFRESH AUTO ON SCHEDULE EVERY 1 hour -DISTRIBUTED BY RANDOM BUCKETS 3 -PROPERTIES ('replication_num' = '1') -AS -SELECT - o_orderkey, - o_custkey, - o_orderstatus, - o_totalprice -FROM orders -WHERE o_orderkey > 10; +BUILD IMMEDIATE REFRESH AUTO ON MANUAL +partition by(l_shipdate) +DISTRIBUTED BY RANDOM BUCKETS 2 +PROPERTIES ('replication_num' = '1') +as +select l_shipdate, o_orderdate, l_partkey, + l_suppkey, sum(o_totalprice) as sum_total +from lineitem + left join orders on lineitem.l_orderkey = orders.o_orderkey and l_shipdate = o_orderdate +group by + l_shipdate, + o_orderdate, + l_partkey, + l_suppkey; ``` -查询语句: +当基表新增分区 `2023-10-21` 时,并且物化视图还未刷新时,可以通过物化视图 union all 原表的方式返回结果 + ```sql -SELECT - o_orderkey, - o_custkey, - o_orderstatus, - o_totalprice -FROM orders -WHERE o_orderkey > 5; +insert into lineitem values + (1, 2, 3, 4, 5.5, 6.5, 7.5, 8.5, 'o', 'k', '2023-10-21', '2023-10-21', '2023-10-21', 'a', 'b', 'yyyyyyyyy'); +``` + +运行查询语句: +```sql +select l_shipdate, o_orderdate, l_partkey, l_suppkey, sum(o_totalprice) as sum_total +from lineitem + left join orders on lineitem.l_orderkey = orders.o_orderkey and l_shipdate = o_orderdate +group by + l_shipdate, + o_orderdate, + l_partkey, + l_suppkey; ``` 改写结果示意: ```sql SELECT * -FROM mv -UNION ALL -SELECT - o_orderkey, - o_custkey, - o_orderstatus, - o_totalprice -FROM orders -WHERE o_orderkey > 5 AND o_orderkey <= 10; +FROM mv7 +union all +select t1.l_shipdate, o_orderdate, t1.l_partkey, t1.l_suppkey, sum(o_totalprice) as sum_total +from (select * from lineitem where l_shipdate = '2023-10-21') t1 + left join orders on t1.l_orderkey = orders.o_orderkey and t1.l_shipdate = o_orderdate +group by + t1.l_shipdate, + o_orderdate, + t1.l_partkey, + t1.l_suppkey; +``` + +注意: +物化视图带 where 条件,以上述为例,如果构建物化的过滤条件加上 `where l_shipdate > '2023-10-19'` 查询是 `where l_shipdate > '2023-10-18'` +目前这种还无法通过 union 补偿,待支持 + + +## 嵌套物化视图改写 +物化视图的定义SQL可以使用物化视图,此物化视图称为嵌套物化视图,嵌套的层数理论上没有限制,此物化视图可以直查,也可以进行透明改写。 +嵌套物化视图也可以参与透明改写。 + +**用例 1** + +首先创建内层物化视图 `mv8_0_inner_mv` +```sql +CREATE MATERIALIZED VIEW mv8_0_inner_mv +BUILD IMMEDIATE REFRESH COMPLETE ON MANUAL +DISTRIBUTED BY RANDOM BUCKETS 2 +PROPERTIES ('replication_num' = '1') +AS +select +l_linenumber, +o_custkey, +o_orderkey, +o_orderstatus, +l_partkey, +l_suppkey, +l_orderkey +from lineitem +inner join orders on lineitem.l_orderkey = orders.o_orderkey; ``` +创建外层物化视图 `mv8_0` +```sql +CREATE MATERIALIZED VIEW mv8_0 +BUILD IMMEDIATE REFRESH COMPLETE ON MANUAL +DISTRIBUTED BY RANDOM BUCKETS 2 +PROPERTIES ('replication_num' = '1') +AS +select +l_linenumber, +o_custkey, +o_orderkey, +o_orderstatus, +l_partkey, +l_suppkey, +l_orderkey, +ps_availqty +from mv8_0_inner_mv +inner join partsupp on l_partkey = ps_partkey AND l_suppkey = ps_suppkey; +``` +如下查询, `mv8_0_inner_mv` 和 `mv8_0` 都会改写成功,最终代价模型会选择 `mv8_0` +```sql +select lineitem.l_linenumber +from lineitem +inner join orders on l_orderkey = o_orderkey +inner join partsupp on l_partkey = ps_partkey AND l_suppkey = ps_suppkey +where o_orderstatus = 'o' +``` + +**注意:** +1. 嵌套物化视图的层数越多,透明改写的耗时会相应增加,建议嵌套物化视图层数不要超过3层。 +2. 嵌套物化视图透明改写默认关闭,开启方式见下面开关。 + + ## 辅助功能 **透明改写后数据一致性问题** @@ -460,22 +537,255 @@ WHERE o_orderkey > 5 AND o_orderkey <= 10; ## 相关环境变量 -| 开关 | 说明 | -|---------------------------------------------------------------------|-----------------------------------| -| SET enable_nereids_planner = true; | 异步物化视图只有在新优化器下才支持,所以需要开启新优化器 | -| SET enable_materialized_view_rewrite = true; | 开启或者关闭查询透明改写,默认关闭 | -| SET materialized_view_rewrite_enable_contain_external_table = true; | 参与透明改写的物化视图是否允许包含外表,默认不允许 | -| SET materialized_view_rewrite_success_candidate_num = 3; | 透明改写成功的结果集合,允许参与到 CBO 候选的最大数量,默认是 3 个 | - +| 开关 | 说明 | +|---------------------------------------------------------------------|------------------------------------------------------------------------------------------------------| +| SET enable_nereids_planner = true; | 异步物化视图只有在新优化器下才支持,所以需要开启新优化器 | +| SET enable_materialized_view_rewrite = true; | 开启或者关闭查询透明改写,默认关闭 | +| SET materialized_view_rewrite_enable_contain_external_table = true; | 参与透明改写的物化视图是否允许包含外表,默认不允许 | +| SET materialized_view_rewrite_success_candidate_num = 3; | 透明改写成功的结果集合,允许参与到 CBO 候选的最大数量,默认是 3 | +| SET enable_materialized_view_union_rewrite = true; | 当分区物化视图不足以提供查询的全部数据时,是否允许基表和物化视图 union all 来响应查询,默认允许 | +| SET enable_materialized_view_nest_rewrite = true; | 是否允许嵌套改写,默认不允许 | +| SET materialized_view_relation_mapping_max_count = 8; | 透明改写过程中,relation mapping最大允许数量,如果超过,进行截取。relation mapping通常由表自关联产生,数量一般会是笛卡尔积,比如3张表,可能会产生 8 种组合。默认是 8 | ## 限制 - 物化视图定义语句中只允许包含 SELECT、FROM、WHERE、JOIN、GROUP BY 语句,JOIN 的输入可以包含简单的 GROUP BY(单表聚合),其中 JOIN 的支持的类型为 - INNER 和 LEFT OUTER JOIN 其他类型的 JOIN 操作逐步支持。 + INNER,LEFT OUTER JOIN,RIGHT OUTER JOIN,FULL OUTER JOIN, LEFT SEMI JOIN,RIGHT SEMI JOIN,LEFT ANTI JOIN,RIGHT ANTI JOIN。 - 基于 External Table 的物化视图不保证查询结果强一致。 - 不支持使用非确定性函数来构建物化视图,包括 rand、now、current_time、current_date、random、uuid 等。 - 不支持窗口函数的透明改写。 -- 查询和物化视图中有 LIMIT,暂时不支持透明改写。 -- 物化视图的定义暂时不能使用视图和物化视图。 +- 物化视图中有 LIMIT,暂时不支持透明改写。 - 当查询或者物化视图没有数据时,不支持透明改写。 -- 目前 WHERE 条件补偿,支持物化视图没有 WHERE,查询有 WHERE 情况的条件补偿;或者物化视图有 WHERE 且查询的 WHERE 条件是物化视图的超集。 - 目前暂时还不支持范围的条件补偿,比如物化视图定义是 a > 5,查询是 a > 10,逐步支持。 \ No newline at end of file +- 目前 WHERE 条件补偿,只支持列为数值和日期类型的条件范围补偿,比如物化视图定义是 a > 5,查询是 a > 10支持透明改写。 + + +# 常见问题 +## 1. 物化视图没有命中是为什么? +确定物化视图是否命中需要执行如下SQL +```sql explain your_query_sql;``` + +a. 物化视图透明改写功能默认是关闭的,需要打开对应开关才可以改写,开关值见 异步物化视图相关开关 + +b. 可能物化视图不可用,导致透明改写不能命中,查看物化视图构建状态见问题2 + +c. 经过前两步的检查,如果物化视图还是不能命中,可能物化视图的定义SQL和查询SQL不在当前物化视图改写能力的范围内,见物化视图透明改写能力 + + +## 2. 怎么查看物化状态是否正常? +### 2.1 确认物化视图构建状态 +物化视图的状态是Success,才可以参与透明改写,首先运行 +```sql +select * from mv_infos('database'='db_name') where Name = 'mv_name' \G +``` +查看物化视图的 JobName。 +其次根据JobName查看物化视图的任务状态,运行如下语句 +```sql +select * from tasks("type"="mv") where JobName = 'job_name'; +``` +查看最近执行的任务状态 `Status` 是否是 `Success` + +### 2.2 确认物化视图数据一致的可用性 +物化视图构建成功,但是因为数据变更,和 `grace_period` 的设置导致物化视图不可用。 +查看物化视图数据一致性的方法 +* 全量构建的物化视图 + 运行如下sql,查看字段 `SyncWithBaseTables` 是否是 1 +```sql +select * from mv_infos('database'='db_name') where Name = 'mv_name' \G +``` +* 分区构建的物化视图 + 运行如下sql,查看查询使用的分区是否有效 +```sql +show partitions from mv_name; +``` + + +## 3. 构建物化时报错 +报错信息 +` ERROR 1105 (HY000): errCode = 2, detailMessage = Syntax error in line 1: +BUILD IMMEDIATE REFRESH AUTO ON MANUAL` + +1. 异步物化视图的语句,在新优化器下才支持,确保使用的是新优化器 + `SET global enable_nereids_planner = true;` +2. 可能是构建物化的语句使用的关键词写错或者物化定义SQL的语法有问题,可以检查下物化定义SQL和创建物化语句是否正确。 + +## 4. 构建分区物化视图报 Unable to find a suitable base table for partitioning +报这个错,通常指的是物化视图的SQL定义和物化视图分区字段的选择,导致不能分区增量更新,所以创建分区物化视图会报这个错。 +物化视图想要分区增量更新,需要满足以下要求,详情见 [CREATE ASYNC MATERIALIZED VIEW](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md) + +满足分区物化视图构建,举例如下: + +```sql + CREATE TABLE IF NOT EXISTS lineitem ( + l_orderkey INTEGER NOT NULL, + l_partkey INTEGER NOT NULL, + l_suppkey INTEGER NOT NULL, + l_linenumber INTEGER NOT NULL, + l_quantity DECIMALV3(15,2) NOT NULL, + l_extendedprice DECIMALV3(15,2) NOT NULL, + l_discount DECIMALV3(15,2) NOT NULL, + l_tax DECIMALV3(15,2) NOT NULL, + l_returnflag CHAR(1) NOT NULL, + l_linestatus CHAR(1) NOT NULL, + l_shipdate DATE NOT NULL, + l_commitdate DATE NOT NULL, + l_receiptdate DATE NOT NULL, + l_shipinstruct CHAR(25) NOT NULL, + l_shipmode CHAR(10) NOT NULL, + l_comment VARCHAR(44) NOT NULL + ) + DUPLICATE KEY(l_orderkey, l_partkey, l_suppkey, l_linenumber) + PARTITION BY RANGE(l_shipdate) ( + PARTITION `day_1` VALUES LESS THAN ('2023-12-9'), + PARTITION `day_2` VALUES LESS THAN ("2023-12-11"), + PARTITION `day_3` VALUES LESS THAN ("2023-12-30")) + DISTRIBUTED BY HASH(l_orderkey) BUCKETS 3 + PROPERTIES ( + "replication_num" = "1" + ); +``` + +```sql + CREATE TABLE IF NOT EXISTS orders ( + o_orderkey INTEGER NOT NULL, + o_custkey INTEGER NOT NULL, + o_orderstatus CHAR(1) NOT NULL, + o_totalprice DECIMALV3(15,2) NOT NULL, + o_orderdate DATE NOT NULL, + o_orderpriority CHAR(15) NOT NULL, + o_clerk CHAR(15) NOT NULL, + o_shippriority INTEGER NOT NULL, + O_COMMENT VARCHAR(79) NOT NULL + ) + DUPLICATE KEY(o_orderkey, o_custkey) + PARTITION BY RANGE(o_orderdate) ( + PARTITION `day_2` VALUES LESS THAN ('2023-12-9'), + PARTITION `day_3` VALUES LESS THAN ("2023-12-11"), + PARTITION `day_4` VALUES LESS THAN ("2023-12-30") + ) + DISTRIBUTED BY HASH(o_orderkey) BUCKETS 3 + PROPERTIES ( + "replication_num" = "1" + ); +``` + +物化视图定义如下,如果 `l_shipdate` 是基表 `lineitem` 的分区字段,如下的物化视图是可以进行分区增量更新的 + +```sql +CREATE MATERIALIZED VIEW mv9 +BUILD IMMEDIATE REFRESH AUTO ON MANUAL +partition by(l_shipdate) +DISTRIBUTED BY HASH(l_orderkey) BUCKETS 10 +PROPERTIES ('replication_num' = '1') +AS +SELECT l_shipdate, l_orderkey, O_ORDERDATE, + count(O_ORDERDATE) over (partition by l_shipdate order by l_orderkey) as window_count +FROM lineitem +LEFT OUTER JOIN orders on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_orderkey, O_ORDERDATE; +``` +如下的物化视图是不可以进行分区增量更新的,因为 `l_shipdate` 来自 `LEFT OUTER JOIN` 的右侧 null 产生端。 + +```sql +CREATE MATERIALIZED VIEW mv10 +BUILD IMMEDIATE REFRESH AUTO ON MANUAL +partition by(l_shipdate) +DISTRIBUTED BY HASH(l_orderkey) BUCKETS 10 +PROPERTIES ('replication_num' = '1') +AS +SELECT l_shipdate, l_orderkey, O_ORDERDATE, + count(O_ORDERDATE) over (partition by l_shipdate order by l_orderkey) as window_count +FROM orders +LEFT OUTER JOIN lineitem on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_orderkey, O_ORDERDATE; +``` + + +## 5. 直查物化物化视图没有数据? +可能物化在构建中,也有可能物化构建失败了。 通过如下语句查看物化构建的状态 + +```sql + -- 查看物化视图元数据信息,database 为当前数据库, mv_name 为物化视图名称 + select * from mv_infos('database'='db_name') where Name = 'mv_name' \G +``` + +```sql +-- 查看任务元数据 +select * from jobs("type"="mv") order by CreateTime limit 5; +``` +```sql +-- 查看任务执行信息,这里面会展示任务执行的状态,如果失败会有失败原因 +select * from tasks("type"="mv") where JobName = 'job_name'; +``` + +## 6. 物化视图使用的基表数据变了,但是此时物化视图还没有刷新,透明改写的行为是? +异步物化视图的数据时效性和基表是有一定时延的。 +对于内表和可以感知数据变化的外表(比如hive),当基表的数据变更时,此物化视图是否可用于透明改写是通过 `grace_period` 的阈值来决定的。 +`grace_period` 指的是容许物化视图和所用基表数据不一致的时间。 + +比如 grace_period 设置成 0,意味要求物化视图和基表数据保持一致,此物化视图才可用于透明改写; +对于外表(除 hive 外),因为无法感知数据变更,所以物化视图使用了外表,无论外表的数据是不是最新的,都可以使用此物化视图用于透明改写(此种情况数据会不一致)。 + +如果设置成 10,意味物化视图和基表数据允许 10s 的延迟,如果物化视图的数据和基表的数据有延迟,如果在 10s 内,此物化视图都可以用于透明改写。 + +如果物化视图是分区物化视图,如果部分分区失效。有如下两种情况 +1. 查询没有使用失效的分区数据,那么此物化视图依然可用于透明改写。 +2. 查询使用了失效分区的数据,并且数据时效在 `grace_period` 范围内,那么此物化视图依然可用。如果物化视图数据时效不在 `grace_period` 范围内。 + 可以通过 union all 原表和物化视图来响应查询。 + +## 7. 怎么确认是否命中,如果不命中怎么查看原因? +可以通过 `explain query_sql` 的方式查看是否命中和不命中的摘要信息,例如如下物化视图 +```sql +CREATE MATERIALIZED VIEW mv11 +BUILD IMMEDIATE REFRESH AUTO ON MANUAL +partition by(l_shipdate) +DISTRIBUTED BY HASH(l_orderkey) BUCKETS 10 +PROPERTIES ('replication_num' = '1') +AS +SELECT l_shipdate, l_orderkey, O_ORDERDATE, count(*) +FROM lineitem + LEFT OUTER JOIN orders on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_orderkey, O_ORDERDATE; +``` + +查询如下 + +```sql +explain +SELECT l_shipdate, l_linestatus, O_ORDERDATE, count(*) +FROM orders + LEFT OUTER JOIN lineitem on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_linestatus, O_ORDERDATE; +``` + +Explain 显示信息可以看到 `MaterializedViewRewriteFail` 有失败的摘要信息, +`The graph logic between query and view is not consistent` 表示查询和物化join的逻辑不一致,上述查询和物化 join的表顺序不一致所以会报这个错。 +```text +| MaterializedView | +| MaterializedViewRewriteSuccessAndChose: | +| | +| MaterializedViewRewriteSuccessButNotChose: | +| | +| MaterializedViewRewriteFail: | +| Name: internal#doc_test#mv11 | +| FailSummary: View struct info is invalid, The graph logic between query and view is not consistent +``` + +来看另一个查询 +```sql +explain +SELECT l_shipdate, l_linestatus, O_ORDERDATE, count(*) +FROM lineitem + LEFT OUTER JOIN orders on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_linestatus, O_ORDERDATE; +``` +Explain 显示信息如下 +```text +| MaterializedView | +| MaterializedViewRewriteSuccessAndChose: | +| | +| MaterializedViewRewriteSuccessButNotChose: | +| | +| MaterializedViewRewriteFail: | +| Name: internal#doc_test#mv11 | +| FailSummary: View struct info is invalid, View dimensions doesn't not cover the query dimensions +``` +失败的摘要信息为 `View dimensions doesn't not cover the query dimensions`,表示查询中 `group by` 的字段不能从物化 `group by` 中获取,会报这个错。 \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md index f9f2ba2a896..e4285372676 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md @@ -26,11 +26,11 @@ under the License. ## CREATE-ASYNC-MATERIALIZED-VIEW -### Name +### 名称 CREATE ASYNC MATERIALIZED VIEW -### Description +### 描述 该语句用于创建异步物化视图。 @@ -50,9 +50,9 @@ CREATE MATERIALIZED VIEW (IF NOT EXISTS)? mvName=multipartIdentifier #### 说明 -##### simpleColumnDefs +`simpleColumnDefs` -用来定义物化视图column信息,如果不定义,将自动推导 +用来定义物化视图 Column 信息,如果不定义,将自动推导。 ```sql simpleColumnDefs @@ -64,15 +64,15 @@ simpleColumnDef ; ``` -例如:定义两列aa和bb,其中aa的注释为"name" +例如:定义两列 aa 和 bb,其中 aa 的注释为"name" ```sql CREATE MATERIALIZED VIEW mv1 (aa comment "name",bb) ``` -##### buildMode +`buildMode` -用来定义物化视图是否创建完成立即刷新,默认IMMEDIATE +用来定义物化视图是否创建完成立即刷新,默认 IMMEDIATE IMMEDIATE:立即刷新 @@ -91,25 +91,25 @@ CREATE MATERIALIZED VIEW mv1 BUILD IMMEDIATE ``` -##### refreshMethod +`refreshMethod` -用来定义物化视图刷新方式,默认AUTO +用来定义物化视图刷新方式,默认 AUTO COMPLETE:全量刷新 AUTO:尽量增量刷新,如果不能分区增量刷新,就全量刷新 -物化视图的 SQL 定义和分区字段需要满足如下条件才可以进行分区增量更新 +物化视图的 SQL 定义和分区字段需要满足如下条件,才可以进行分区增量更新: -- 物化视图使用的 base table 中至少有一个是分区表。 -- 物化视图使用的分区表,必须使用 list 或者 range 分区策略。 -- 物化视图最顶层的分区列只能有一个分区字段。 -- 物化视图的 SQL 中需要使用了 base table 中的分区列,比如在 Select 后。 -- 如果使用了group by,分区列的字段一定要在 group by 后。 -- 如果使用了 window 函数,分区列的字段一定要在partition by后。 +- 物化视图使用的 Base Table 中至少有一个是分区表。 +- 物化视图使用的 Base Table 分区表,必须使用 List 或者 Range 分区策略。 +- 物化视图定义 SQL 中 Partition By 分区列只能有一个分区字段。 +- 物化视图的 SQL 中 Partition By 的分区列,要在 Select 后。 +- 物化视图定义 SQL,如果使用了 Group By,分区列的字段一定要在 Group By 后。 +- 物化视图定义 SQL,如果使用了 Window 函数,分区列的字段一定要在 Partition By 后。 - 数据变更应发生在分区表上,如果发生在非分区表,物化视图需要全量构建。 -- 物化视图使用 Join 的 null 产生端的字段作为分区字段,不能分区增量更新。 -- 物化视图使用的 base table 表分区字段,如果来源于内表,base table 的 null 属性不能为空,如果来源于 hive 外表,base table 的 null 属性可以为空。 +- 物化视图使用 Join 的 NULL 产生端的字段作为分区字段,不能分区增量更新,例如对于 LEFT OUTER JOIN 分区字段需要在左侧,在右侧则不行。 + ```sql refreshMethod @@ -123,9 +123,9 @@ CREATE MATERIALIZED VIEW mv1 REFRESH COMPLETE ``` -##### refreshTrigger +`refreshTrigger` -物化视图刷新数据的触发方式,默认MANUAL +物化视图刷新数据的触发方式,默认 MANUAL MANUAL:手动刷新 @@ -152,8 +152,8 @@ CREATE MATERIALIZED VIEW mv1 REFRESH ON SCHEDULE EVERY 2 HOUR STARTS "2023-12-13 21:07:09" ``` -##### key -物化视图为DUPLICATE KEY模型,因此指定的列为排序列 +`key` +物化视图为 Duplicate Key 模型,因此指定的列为排序列 ```sql identifierList @@ -165,42 +165,41 @@ identifierSeq ; ``` -例如:指定k1,k2为排序列 +例如:指定k1,k2为排序列 ```sql CREATE MATERIALIZED VIEW mv1 KEY(k1,k2) ``` -##### partition -物化视图有两种分区方式,如果不指定分区,默认只有一个分区,如果指定分区字段,会自动推导出字段来自哪个基表并同步基表(当前支持`OlapTable`和`hive`)的所有分区(限制条件:基表如果是`OlapTable`,那么只能有一个分区字段) +`partition` +物化视图有两种分区方式,如果不指定分区,默认只有一个分区,如果指定分区字段,会自动推导出字段来自哪个基表并同步基表(当前支持 `OlapTable` 和 `hive`)的所有分区(限制条件:基表如果是 `OlapTable`,那么只能有一个分区字段)。 -例如:基表是range分区,分区字段为`create_time`并按天分区,创建物化视图时指定`partition by(ct) as select create_time as ct from t1` -那么物化视图也会是range分区,分区字段为`ct`,并且按天分区 +例如:基表是 Range 分区,分区字段为 `create_time` 并按天分区,创建物化视图时指定 `partition by(ct) as select create_time as ct from t1`,那么物化视图也会是 Range 分区,分区字段为 `ct`,并且按天分区。 -分区字段的选择和物化视图的定义需要满足分区增量更新的条件,物化视图才可以创建成功,否则会报错 `Unable to find a suitable base table for partitioning` +分区字段的选择和物化视图的定义需要满足分区增量更新的条件,物化视图才可以创建成功,否则会报错 `Unable to find a suitable base table for partitioning`。 -#### property -物化视图既可以指定table的property,也可以指定物化视图特有的property。 +#### Property +物化视图既可以指定 Table 的 Property,也可以指定物化视图特有的 Property。 -物化视图特有的property包括: +物化视图特有的 Property 包括: -`grace_period`:查询改写时允许物化视图数据的最大延迟时间(单位:秒)。如果分区A和基表的数据不一致,物化视图的分区A上次刷新时间为1,系统当前时间为2,那么该分区不会被透明改写。但是如果grace_period大于等于1,该分区就会被用于透明改写 +`grace_period`:查询改写时允许物化视图数据的最大延迟时间(单位:秒)。如果分区 A 和基表的数据不一致,物化视图的分区 A 上次刷新时间为 1,系统当前时间为 2,那么该分区不会被透明改写。但是如果 `grace_period` 大于等于1,该分区就会被用于透明改写。 `excluded_trigger_tables`:数据刷新时忽略的表名,逗号分割。例如`table1,table2` -`refresh_partition_num`:单次insert语句刷新的分区数量,默认为1。物化视图刷新时会先计算要刷新的分区列表,然后根据该配置拆分成多个insert语句顺序执行。遇到失败的insert语句,整个任务将停止执行。物化视图保证单个insert语句的事务性,失败的insert语句不会影响到已经刷新成功的分区。 +`refresh_partition_num`:单次 Insert 语句刷新的分区数量,默认为 1。物化视图刷新时会先计算要刷新的分区列表,然后根据该配置拆分成多个 Insert 语句顺序执行。遇到失败的 Insert 语句,整个任务将停止执行。物化视图保证单个 Insert 语句的事务性,失败的 Insert 语句不会影响到已经刷新成功的分区。 -`workload_group`:物化视图执行刷新任务时使用的workload_group名称。用来限制物化视图刷新数据使用的资源,避免影响到其它业务的运行。workload_group创建及使用 [WORKLOAD-GROUP](../../../../admin-manual/workload-group.md) +`workload_group`:物化视图执行刷新任务时使用的 `workload_group` 名称。用来限制物化视图刷新数据使用的资源,避免影响到其它业务的运行。关于 `workload_group` 的创建及使用,可参考 [Workload Group](../../../../admin-manual/workload-group.md) 文档。 -`partition_sync_limit`:当基表的分区字段为时间时(如果是字符串类型的时间,可以设置partition_date_format),可以用此属性配置同步基表的分区范围,配合`partition_sync_time_unit`一起使用。 -例如设置为 2,`partition_sync_time_unit`设置为 MONTH,代表仅同步基表近2个月的分区和数据。最小值为1。 -随着时间的变化物化视图每次刷新时都会自动增删分区,例如物化视图现在有2,3两个月的数据,下个月的时候,会自动删除2月的数据,增加4月的数据。 +`partition_sync_limit`:当基表的分区字段为时间时(如果是字符串类型的时间,可以设置 `partition_date_format`),可以用此属性配置同步基表的分区范围,配合 `partition_sync_time_unit` 一起使用。 +例如设置为 2,`partition_sync_time_unit` 设置为 `MONTH`,代表仅同步基表近 2 个月的分区和数据。最小值为 `1`。 +随着时间的变化物化视图每次刷新时都会自动增删分区,例如物化视图现在有 2,3 两个月的数据,下个月的时候,会自动删除 2 月的数据,增加 4 月的数据。 `partition_sync_time_unit`:时间单位,支持 DAY/MONTH/YEAR(默认DAY) `partition_date_format`:分区字段的时间格式,例如"%Y-%m-%d" -##### query +`query` 创建物化视图的查询语句,其结果即为物化视图中的数据 @@ -209,9 +208,9 @@ KEY(k1,k2) SELECT random() as dd,k3 FROM user ``` -### Example +### 示例 -1. 创建一个立即刷新,之后每周刷新一次的物化视图mv1,数据源为hive catalog +1. 创建一个立即刷新,之后每周刷新一次的物化视图 `mv1`,数据源为 Hive Catalog ```sql CREATE MATERIALIZED VIEW mv1 BUILD IMMEDIATE REFRESH COMPLETE ON SCHEDULE EVERY 1 WEEK @@ -222,7 +221,7 @@ SELECT random() as dd,k3 FROM user AS SELECT * FROM hive_catalog.db1.user; ``` -2. 创建一个多表join的物化视图 +2. 创建一个多表 Join 的物化视图 ```sql CREATE MATERIALIZED VIEW mv1 BUILD IMMEDIATE REFRESH COMPLETE ON SCHEDULE EVERY 1 WEEK @@ -233,7 +232,6 @@ SELECT random() as dd,k3 FROM user AS select user.k1,user.k3,com.k4 from user join com on user.k1=com.k1; ``` -### Keywords +### 关键词 CREATE, ASYNC, MATERIALIZED, VIEW - diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/query/view-materialized-view/query-async-materialized-view.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/query/view-materialized-view/query-async-materialized-view.md index d42d796a4e8..183c62e8c29 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/query/view-materialized-view/query-async-materialized-view.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/query/view-materialized-view/query-async-materialized-view.md @@ -54,7 +54,7 @@ CREATE TABLE IF NOT EXISTS lineitem ( ) DUPLICATE KEY(l_orderkey, l_partkey, l_suppkey, l_linenumber) PARTITION BY RANGE(l_shipdate) -(FROM ('2023-10-17') TO ('2023-10-20') INTERVAL 1 DAY) +(FROM ('2023-10-17') TO ('2023-11-01') INTERVAL 1 DAY) DISTRIBUTED BY HASH(l_orderkey) BUCKETS 3 PROPERTIES ("replication_num" = "1"); @@ -77,7 +77,7 @@ CREATE TABLE IF NOT EXISTS orders ( ) DUPLICATE KEY(o_orderkey, o_custkey) PARTITION BY RANGE(o_orderdate)( - FROM ('2023-10-17') TO ('2023-10-20') INTERVAL 1 DAY) + FROM ('2023-10-17') TO ('2023-11-01') INTERVAL 1 DAY) DISTRIBUTED BY HASH(o_orderkey) BUCKETS 3 PROPERTIES ("replication_num" = "1"); @@ -325,15 +325,16 @@ l_suppkey; 暂时目前支持的聚合上卷函数列表如下: -| 查询中函数 | 物化视图中函数 | 函数上卷后 | -|--------------------|---------------|---------------------| -| max | max | max | -| min | min | min | -| sum | sum | sum | -| count | count | sum | -| count(distinct ) | bitmap_union | bitmap_union_count | -| bitmap_union | bitmap_union | bitmap_union | -| bitmap_union_count | bitmap_union | bitmap_union_count | +| 查询中函数 | 物化视图中函数 | 函数上卷后 | +|-------------------------------------------------------|-----------------------------|--------------------| +| max | max | max | +| min | min | min | +| sum | sum | sum | +| count | count | sum | +| count(distinct ) | bitmap_union | bitmap_union_count | +| bitmap_union | bitmap_union | bitmap_union | +| bitmap_union_count | bitmap_union | bitmap_union_count | +| hll_union_agg, approx_count_distinct, hll_cardinality | hll_union 或者 hll_raw_agg | hll_union_agg | ## Query partial 透明改写(Coming soon) 当物化视图的表比查询多时,如果物化视图比查询多的表满足 JOIN 消除的条件,那么也可以进行透明改写,如下可以进行透明改写,待支持。 @@ -367,52 +368,128 @@ mv 定义: LEFT OUTER JOIN orders ON L_ORDERKEY = O_ORDERKEY; ``` -## Union 改写(Coming soon) -当物化视图不足以提供查询的所有数据时,可以通过 Union 的方式,将查询原表和物化视图 Union 起来返回数据,如下可以进行透明改写,待支持。 +## Union 改写 +当物化视图不足以提供查询的所有数据时,可以使用 union all 的方式,将查询原表和物化视图的数据结合作为最终返回结果。 +目前需要物化视图是分区物化视图,可以对分区字段的过滤条件使用 union all 补全数据。 **用例 1** mv 定义: ```sql CREATE MATERIALIZED VIEW mv7 -BUILD IMMEDIATE REFRESH AUTO ON SCHEDULE EVERY 1 hour -DISTRIBUTED BY RANDOM BUCKETS 3 -PROPERTIES ('replication_num' = '1') -AS -SELECT - o_orderkey, - o_custkey, - o_orderstatus, - o_totalprice -FROM orders -WHERE o_orderkey > 10; +BUILD IMMEDIATE REFRESH AUTO ON MANUAL +partition by(l_shipdate) +DISTRIBUTED BY RANDOM BUCKETS 2 +PROPERTIES ('replication_num' = '1') +as +select l_shipdate, o_orderdate, l_partkey, + l_suppkey, sum(o_totalprice) as sum_total +from lineitem + left join orders on lineitem.l_orderkey = orders.o_orderkey and l_shipdate = o_orderdate +group by + l_shipdate, + o_orderdate, + l_partkey, + l_suppkey; ``` -查询语句: +当基表新增分区 `2023-10-21` 时,并且物化视图还未刷新时,可以通过物化视图 union all 原表的方式返回结果 + ```sql -SELECT - o_orderkey, - o_custkey, - o_orderstatus, - o_totalprice -FROM orders -WHERE o_orderkey > 5; +insert into lineitem values + (1, 2, 3, 4, 5.5, 6.5, 7.5, 8.5, 'o', 'k', '2023-10-21', '2023-10-21', '2023-10-21', 'a', 'b', 'yyyyyyyyy'); +``` + +运行查询语句: +```sql +select l_shipdate, o_orderdate, l_partkey, l_suppkey, sum(o_totalprice) as sum_total +from lineitem + left join orders on lineitem.l_orderkey = orders.o_orderkey and l_shipdate = o_orderdate +group by + l_shipdate, + o_orderdate, + l_partkey, + l_suppkey; ``` 改写结果示意: ```sql SELECT * -FROM mv -UNION ALL -SELECT - o_orderkey, - o_custkey, - o_orderstatus, - o_totalprice -FROM orders -WHERE o_orderkey > 5 AND o_orderkey <= 10; +FROM mv7 +union all +select t1.l_shipdate, o_orderdate, t1.l_partkey, t1.l_suppkey, sum(o_totalprice) as sum_total +from (select * from lineitem where l_shipdate = '2023-10-21') t1 + left join orders on t1.l_orderkey = orders.o_orderkey and t1.l_shipdate = o_orderdate +group by + t1.l_shipdate, + o_orderdate, + t1.l_partkey, + t1.l_suppkey; ``` +注意: +物化视图带 where 条件,以上述为例,如果构建物化的过滤条件加上 `where l_shipdate > '2023-10-19'` 查询是 `where l_shipdate > '2023-10-18'` +目前这种还无法通过 union 补偿,待支持 + + +## 嵌套物化视图改写 +物化视图的定义SQL可以使用物化视图,此物化视图称为嵌套物化视图,嵌套的层数理论上没有限制,此物化视图可以直查,也可以进行透明改写。 +嵌套物化视图也可以参与透明改写。 + +**用例 1** + +首先创建内层物化视图 `mv8_0_inner_mv` +```sql +CREATE MATERIALIZED VIEW mv8_0_inner_mv +BUILD IMMEDIATE REFRESH COMPLETE ON MANUAL +DISTRIBUTED BY RANDOM BUCKETS 2 +PROPERTIES ('replication_num' = '1') +AS +select +l_linenumber, +o_custkey, +o_orderkey, +o_orderstatus, +l_partkey, +l_suppkey, +l_orderkey +from lineitem +inner join orders on lineitem.l_orderkey = orders.o_orderkey; +``` + +创建外层物化视图 `mv8_0` +```sql +CREATE MATERIALIZED VIEW mv8_0 +BUILD IMMEDIATE REFRESH COMPLETE ON MANUAL +DISTRIBUTED BY RANDOM BUCKETS 2 +PROPERTIES ('replication_num' = '1') +AS +select +l_linenumber, +o_custkey, +o_orderkey, +o_orderstatus, +l_partkey, +l_suppkey, +l_orderkey, +ps_availqty +from mv8_0_inner_mv +inner join partsupp on l_partkey = ps_partkey AND l_suppkey = ps_suppkey; +``` +如下查询, `mv8_0_inner_mv` 和 `mv8_0` 都会改写成功,最终代价模型会选择 `mv8_0` +```sql +select lineitem.l_linenumber +from lineitem +inner join orders on l_orderkey = o_orderkey +inner join partsupp on l_partkey = ps_partkey AND l_suppkey = ps_suppkey +where o_orderstatus = 'o' +``` + +**注意:** +1. 嵌套物化视图的层数越多,透明改写的耗时会相应增加,建议嵌套物化视图层数不要超过3层。 +2. 嵌套物化视图透明改写默认关闭,开启方式见下面开关。 + + ## 辅助功能 **透明改写后数据一致性问题** @@ -460,22 +537,255 @@ WHERE o_orderkey > 5 AND o_orderkey <= 10; ## 相关环境变量 -| 开关 | 说明 | -|---------------------------------------------------------------------|-----------------------------------| -| SET enable_nereids_planner = true; | 异步物化视图只有在新优化器下才支持,所以需要开启新优化器 | -| SET enable_materialized_view_rewrite = true; | 开启或者关闭查询透明改写,默认关闭 | -| SET materialized_view_rewrite_enable_contain_external_table = true; | 参与透明改写的物化视图是否允许包含外表,默认不允许 | -| SET materialized_view_rewrite_success_candidate_num = 3; | 透明改写成功的结果集合,允许参与到 CBO 候选的最大数量,默认是 3 个 | - +| 开关 | 说明 | +|---------------------------------------------------------------------|------------------------------------------------------------------------------------------------------| +| SET enable_nereids_planner = true; | 异步物化视图只有在新优化器下才支持,所以需要开启新优化器 | +| SET enable_materialized_view_rewrite = true; | 开启或者关闭查询透明改写,默认关闭 | +| SET materialized_view_rewrite_enable_contain_external_table = true; | 参与透明改写的物化视图是否允许包含外表,默认不允许 | +| SET materialized_view_rewrite_success_candidate_num = 3; | 透明改写成功的结果集合,允许参与到 CBO 候选的最大数量,默认是 3 | +| SET enable_materialized_view_union_rewrite = true; | 当分区物化视图不足以提供查询的全部数据时,是否允许基表和物化视图 union all 来响应查询,默认允许 | +| SET enable_materialized_view_nest_rewrite = true; | 是否允许嵌套改写,默认不允许 | +| SET materialized_view_relation_mapping_max_count = 8; | 透明改写过程中,relation mapping最大允许数量,如果超过,进行截取。relation mapping通常由表自关联产生,数量一般会是笛卡尔积,比如3张表,可能会产生 8 种组合。默认是 8 | ## 限制 - 物化视图定义语句中只允许包含 SELECT、FROM、WHERE、JOIN、GROUP BY 语句,JOIN 的输入可以包含简单的 GROUP BY(单表聚合),其中 JOIN 的支持的类型为 - INNER 和 LEFT OUTER JOIN 其他类型的 JOIN 操作逐步支持。 + INNER,LEFT OUTER JOIN,RIGHT OUTER JOIN,FULL OUTER JOIN, LEFT SEMI JOIN,RIGHT SEMI JOIN,LEFT ANTI JOIN,RIGHT ANTI JOIN。 - 基于 External Table 的物化视图不保证查询结果强一致。 - 不支持使用非确定性函数来构建物化视图,包括 rand、now、current_time、current_date、random、uuid 等。 - 不支持窗口函数的透明改写。 -- 查询和物化视图中有 LIMIT,暂时不支持透明改写。 -- 物化视图的定义暂时不能使用视图和物化视图。 +- 物化视图中有 LIMIT,暂时不支持透明改写。 - 当查询或者物化视图没有数据时,不支持透明改写。 -- 目前 WHERE 条件补偿,支持物化视图没有 WHERE,查询有 WHERE 情况的条件补偿;或者物化视图有 WHERE 且查询的 WHERE 条件是物化视图的超集。 - 目前暂时还不支持范围的条件补偿,比如物化视图定义是 a > 5,查询是 a > 10,逐步支持。 \ No newline at end of file +- 目前 WHERE 条件补偿,只支持列为数值和日期类型的条件范围补偿,比如物化视图定义是 a > 5,查询是 a > 10支持透明改写。 + + +# 常见问题 +## 1. 物化视图没有命中是为什么? + 确定物化视图是否命中需要执行如下SQL + ```sql explain your_query_sql;``` + + a. 物化视图透明改写功能默认是关闭的,需要打开对应开关才可以改写,开关值见 异步物化视图相关开关 + + b. 可能物化视图不可用,导致透明改写不能命中,查看物化视图构建状态见问题2 + + c. 经过前两步的检查,如果物化视图还是不能命中,可能物化视图的定义SQL和查询SQL不在当前物化视图改写能力的范围内,见物化视图透明改写能力 + + +## 2. 怎么查看物化状态是否正常? +### 2.1 确认物化视图构建状态 +物化视图的状态是Success,才可以参与透明改写,首先运行 +```sql +select * from mv_infos('database'='db_name') where Name = 'mv_name' \G +``` +查看物化视图的 JobName。 +其次根据JobName查看物化视图的任务状态,运行如下语句 +```sql +select * from tasks("type"="mv") where JobName = 'job_name'; +``` +查看最近执行的任务状态 `Status` 是否是 `Success` + +### 2.2 确认物化视图数据一致的可用性 +物化视图构建成功,但是因为数据变更,和 `grace_period` 的设置导致物化视图不可用。 +查看物化视图数据一致性的方法 +* 全量构建的物化视图 +运行如下sql,查看字段 `SyncWithBaseTables` 是否是 1 +```sql +select * from mv_infos('database'='db_name') where Name = 'mv_name' \G +``` +* 分区构建的物化视图 +运行如下sql,查看查询使用的分区是否有效 +```sql +show partitions from mv_name; +``` + + +## 3. 构建物化时报错 + 报错信息 +` ERROR 1105 (HY000): errCode = 2, detailMessage = Syntax error in line 1: + BUILD IMMEDIATE REFRESH AUTO ON MANUAL` + +1. 异步物化视图的语句,在新优化器下才支持,确保使用的是新优化器 +`SET global enable_nereids_planner = true;` +2. 可能是构建物化的语句使用的关键词写错或者物化定义SQL的语法有问题,可以检查下物化定义SQL和创建物化语句是否正确。 + +## 4. 构建分区物化视图报 Unable to find a suitable base table for partitioning + 报这个错,通常指的是物化视图的SQL定义和物化视图分区字段的选择,导致不能分区增量更新,所以创建分区物化视图会报这个错。 + 物化视图想要分区增量更新,需要满足以下要求,详情见 [CREATE ASYNC MATERIALIZED VIEW](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md) + +满足分区物化视图构建,举例如下: + +```sql + CREATE TABLE IF NOT EXISTS lineitem ( + l_orderkey INTEGER NOT NULL, + l_partkey INTEGER NOT NULL, + l_suppkey INTEGER NOT NULL, + l_linenumber INTEGER NOT NULL, + l_quantity DECIMALV3(15,2) NOT NULL, + l_extendedprice DECIMALV3(15,2) NOT NULL, + l_discount DECIMALV3(15,2) NOT NULL, + l_tax DECIMALV3(15,2) NOT NULL, + l_returnflag CHAR(1) NOT NULL, + l_linestatus CHAR(1) NOT NULL, + l_shipdate DATE NOT NULL, + l_commitdate DATE NOT NULL, + l_receiptdate DATE NOT NULL, + l_shipinstruct CHAR(25) NOT NULL, + l_shipmode CHAR(10) NOT NULL, + l_comment VARCHAR(44) NOT NULL + ) + DUPLICATE KEY(l_orderkey, l_partkey, l_suppkey, l_linenumber) + PARTITION BY RANGE(l_shipdate) ( + PARTITION `day_1` VALUES LESS THAN ('2023-12-9'), + PARTITION `day_2` VALUES LESS THAN ("2023-12-11"), + PARTITION `day_3` VALUES LESS THAN ("2023-12-30")) + DISTRIBUTED BY HASH(l_orderkey) BUCKETS 3 + PROPERTIES ( + "replication_num" = "1" + ); +``` + +```sql + CREATE TABLE IF NOT EXISTS orders ( + o_orderkey INTEGER NOT NULL, + o_custkey INTEGER NOT NULL, + o_orderstatus CHAR(1) NOT NULL, + o_totalprice DECIMALV3(15,2) NOT NULL, + o_orderdate DATE NOT NULL, + o_orderpriority CHAR(15) NOT NULL, + o_clerk CHAR(15) NOT NULL, + o_shippriority INTEGER NOT NULL, + O_COMMENT VARCHAR(79) NOT NULL + ) + DUPLICATE KEY(o_orderkey, o_custkey) + PARTITION BY RANGE(o_orderdate) ( + PARTITION `day_2` VALUES LESS THAN ('2023-12-9'), + PARTITION `day_3` VALUES LESS THAN ("2023-12-11"), + PARTITION `day_4` VALUES LESS THAN ("2023-12-30") + ) + DISTRIBUTED BY HASH(o_orderkey) BUCKETS 3 + PROPERTIES ( + "replication_num" = "1" + ); +``` + +物化视图定义如下,如果 `l_shipdate` 是基表 `lineitem` 的分区字段,如下的物化视图是可以进行分区增量更新的 + +```sql +CREATE MATERIALIZED VIEW mv9 +BUILD IMMEDIATE REFRESH AUTO ON MANUAL +partition by(l_shipdate) +DISTRIBUTED BY HASH(l_orderkey) BUCKETS 10 +PROPERTIES ('replication_num' = '1') +AS +SELECT l_shipdate, l_orderkey, O_ORDERDATE, + count(O_ORDERDATE) over (partition by l_shipdate order by l_orderkey) as window_count +FROM lineitem +LEFT OUTER JOIN orders on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_orderkey, O_ORDERDATE; +``` +如下的物化视图是不可以进行分区增量更新的,因为 `l_shipdate` 来自 `LEFT OUTER JOIN` 的右侧 null 产生端。 + +```sql +CREATE MATERIALIZED VIEW mv10 +BUILD IMMEDIATE REFRESH AUTO ON MANUAL +partition by(l_shipdate) +DISTRIBUTED BY HASH(l_orderkey) BUCKETS 10 +PROPERTIES ('replication_num' = '1') +AS +SELECT l_shipdate, l_orderkey, O_ORDERDATE, + count(O_ORDERDATE) over (partition by l_shipdate order by l_orderkey) as window_count +FROM orders +LEFT OUTER JOIN lineitem on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_orderkey, O_ORDERDATE; +``` + + +## 5. 直查物化物化视图没有数据? +可能物化在构建中,也有可能物化构建失败了。 通过如下语句查看物化构建的状态 + +```sql + -- 查看物化视图元数据信息,database 为当前数据库, mv_name 为物化视图名称 + select * from mv_infos('database'='db_name') where Name = 'mv_name' \G +``` + +```sql +-- 查看任务元数据 +select * from jobs("type"="mv") order by CreateTime limit 5; +``` +```sql +-- 查看任务执行信息,这里面会展示任务执行的状态,如果失败会有失败原因 +select * from tasks("type"="mv") where JobName = 'job_name'; +``` + +## 6. 物化视图使用的基表数据变了,但是此时物化视图还没有刷新,透明改写的行为是? +异步物化视图的数据时效性和基表是有一定时延的。 +对于内表和可以感知数据变化的外表(比如hive),当基表的数据变更时,此物化视图是否可用于透明改写是通过 `grace_period` 的阈值来决定的。 +`grace_period` 指的是容许物化视图和所用基表数据不一致的时间。 + +比如 grace_period 设置成 0,意味要求物化视图和基表数据保持一致,此物化视图才可用于透明改写; +对于外表(除 hive 外),因为无法感知数据变更,所以物化视图使用了外表,无论外表的数据是不是最新的,都可以使用此物化视图用于透明改写(此种情况数据会不一致)。 + +如果设置成 10,意味物化视图和基表数据允许 10s 的延迟,如果物化视图的数据和基表的数据有延迟,如果在 10s 内,此物化视图都可以用于透明改写。 + +如果物化视图是分区物化视图,如果部分分区失效。有如下两种情况 +1. 查询没有使用失效的分区数据,那么此物化视图依然可用于透明改写。 +2. 查询使用了失效分区的数据,并且数据时效在 `grace_period` 范围内,那么此物化视图依然可用。如果物化视图数据时效不在 `grace_period` 范围内。 +可以通过 union all 原表和物化视图来响应查询。 + +## 7. 怎么确认是否命中,如果不命中怎么查看原因? +可以通过 `explain query_sql` 的方式查看是否命中和不命中的摘要信息,例如如下物化视图 +```sql +CREATE MATERIALIZED VIEW mv11 +BUILD IMMEDIATE REFRESH AUTO ON MANUAL +partition by(l_shipdate) +DISTRIBUTED BY HASH(l_orderkey) BUCKETS 10 +PROPERTIES ('replication_num' = '1') +AS +SELECT l_shipdate, l_orderkey, O_ORDERDATE, count(*) +FROM lineitem +LEFT OUTER JOIN orders on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_orderkey, O_ORDERDATE; +``` + +查询如下 + +```sql +explain +SELECT l_shipdate, l_linestatus, O_ORDERDATE, count(*) +FROM orders +LEFT OUTER JOIN lineitem on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_linestatus, O_ORDERDATE; +``` + +Explain 显示信息可以看到 `MaterializedViewRewriteFail` 有失败的摘要信息, +`The graph logic between query and view is not consistent` 表示查询和物化join的逻辑不一致,上述查询和物化 join的表顺序不一致所以会报这个错。 +```text +| MaterializedView | +| MaterializedViewRewriteSuccessAndChose: | +| | +| MaterializedViewRewriteSuccessButNotChose: | +| | +| MaterializedViewRewriteFail: | +| Name: internal#doc_test#mv11 | +| FailSummary: View struct info is invalid, The graph logic between query and view is not consistent +``` + +来看另一个查询 +```sql +explain +SELECT l_shipdate, l_linestatus, O_ORDERDATE, count(*) +FROM lineitem +LEFT OUTER JOIN orders on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_linestatus, O_ORDERDATE; +``` +Explain 显示信息如下 +```text +| MaterializedView | +| MaterializedViewRewriteSuccessAndChose: | +| | +| MaterializedViewRewriteSuccessButNotChose: | +| | +| MaterializedViewRewriteFail: | +| Name: internal#doc_test#mv11 | +| FailSummary: View struct info is invalid, View dimensions doesn't not cover the query dimensions +``` +失败的摘要信息为 `View dimensions doesn't not cover the query dimensions`,表示查询中 `group by` 的字段不能从物化 `group by` 中获取,会报这个错。 \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md index f9f2ba2a896..694ba98371f 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md @@ -93,147 +93,100 @@ BUILD IMMEDIATE ##### refreshMethod -用来定义物化视图刷新方式,默认AUTO +用来定义物化视图刷新方式,默认 AUTO COMPLETE:全量刷新 AUTO:尽量增量刷新,如果不能分区增量刷新,就全量刷新 -物化视图的 SQL 定义和分区字段需要满足如下条件才可以进行分区增量更新 +物化视图的 SQL 定义和分区字段需要满足如下条件,才可以进行分区增量更新: -- 物化视图使用的 base table 中至少有一个是分区表。 -- 物化视图使用的分区表,必须使用 list 或者 range 分区策略。 -- 物化视图最顶层的分区列只能有一个分区字段。 -- 物化视图的 SQL 中需要使用了 base table 中的分区列,比如在 Select 后。 -- 如果使用了group by,分区列的字段一定要在 group by 后。 -- 如果使用了 window 函数,分区列的字段一定要在partition by后。 +- 物化视图使用的 Base Table 中至少有一个是分区表。 +- 物化视图使用的 Base Table 分区表,必须使用 List 或者 Range 分区策略。 +- 物化视图定义 SQL 中 Partition By 分区列只能有一个分区字段。 +- 物化视图的 SQL 中 Partition By 的分区列,要在 Select 后。 +- 物化视图定义 SQL,如果使用了 Group By,分区列的字段一定要在 Group By 后。 +- 物化视图定义 SQL,如果使用了 Window 函数,分区列的字段一定要在 Partition By 后。 - 数据变更应发生在分区表上,如果发生在非分区表,物化视图需要全量构建。 -- 物化视图使用 Join 的 null 产生端的字段作为分区字段,不能分区增量更新。 -- 物化视图使用的 base table 表分区字段,如果来源于内表,base table 的 null 属性不能为空,如果来源于 hive 外表,base table 的 null 属性可以为空。 +- 物化视图使用 Join 的 NULL 产生端的字段作为分区字段,不能分区增量更新,例如对于 LEFT OUTER JOIN 分区字段需要在左侧,在右侧则不行。 -```sql -refreshMethod -: COMPLETE | AUTO -; -``` -例如:指定物化视图全量刷新 ```sql -CREATE MATERIALIZED VIEW mv1 -REFRESH COMPLETE -``` +refreshMethod + @@ -125,7 +125,7 @@ REFRESH COMPLETE ##### refreshTrigger -物化视图刷新数据的触发方式,默认MANUAL +物化视图刷新数据的触发方式,默认 MANUAL MANUAL:手动刷新 -SCHEDULE:定时刷新 - -```sql -refreshTrigger -: ON MANUAL -| ON SCHEDULE refreshSchedule -; - -refreshSchedule -: EVERY INTEGER_VALUE mvRefreshUnit (STARTS STRING_LITERAL)? -; - -mvRefreshUnit -: MINUTE | HOUR | DAY | WEEK -; -``` - -例如:每2小时执行一次,从2023-12-13 21:07:09开始 -```sql -CREATE MATERIALIZED VIEW mv1 -REFRESH ON SCHEDULE EVERY 2 HOUR STARTS "2023-12-13 21:07:09" + @@ -153,7 +153,7 @@ REFRESH ON SCHEDULE EVERY 2 HOUR STARTS "2023-12-13 21:07:09" ``` ##### key -物化视图为DUPLICATE KEY模型,因此指定的列为排序列 +物化视图为 Duplicate Key 模型,因此指定的列为排序列 ```sql identifierList -: LEFT_PAREN identifierSeq RIGHT_PAREN - ; - -identifierSeq -: ident+=errorCapturingIdentifier (COMMA ident+=errorCapturingIdentifier)* + @@ -165,36 +165,35 @@ identifierSeq ; ``` -例如:指定k1,k2为排序列 +例如:指定k1,k2为排序列 ```sql CREATE MATERIALIZED VIEW mv1 KEY(k1,k2) ``` ##### partition -物化视图有两种分区方式,如果不指定分区,默认只有一个分区,如果指定分区字段,会自动推导出字段来自哪个基表并同步基表(当前支持`OlapTable`和`hive`)的所有分区(限制条件:基表如果是`OlapTable`,那么只能有一个分区字段) +物化视图有两种分区方式,如果不指定分区,默认只有一个分区,如果指定分区字段,会自动推导出字段来自哪个基表并同步基表(当前支持 `OlapTable` 和 `hive`)的所有分区(限制条件:基表如果是 `OlapTable`,那么只能有一个分区字段)。 -例如:基表是range分区,分区字段为`create_time`并按天分区,创建物化视图时指定`partition by(ct) as select create_time as ct from t1` -那么物化视图也会是range分区,分区字段为`ct`,并且按天分区 +例如:基表是 Range 分区,分区字段为 `create_time` 并按天分区,创建物化视图时指定 `partition by(ct) as select create_time as ct from t1`,那么物化视图也会是 Range 分区,分区字段为 `ct`,并且按天分区。 -分区字段的选择和物化视图的定义需要满足分区增量更新的条件,物化视图才可以创建成功,否则会报错 `Unable to find a suitable base table for partitioning` +分区字段的选择和物化视图的定义需要满足分区增量更新的条件,物化视图才可以创建成功,否则会报错 `Unable to find a suitable base table for partitioning`。 #### property -物化视图既可以指定table的property,也可以指定物化视图特有的property。 +物化视图既可以指定 Table 的 property,也可以指定物化视图特有的 property。 -物化视图特有的property包括: +物化视图特有的 property 包括: -`grace_period`:查询改写时允许物化视图数据的最大延迟时间(单位:秒)。如果分区A和基表的数据不一致,物化视图的分区A上次刷新时间为1,系统当前时间为2,那么该分区不会被透明改写。但是如果grace_period大于等于1,该分区就会被用于透明改写 +`grace_period`:查询改写时允许物化视图数据的最大延迟时间(单位:秒)。如果分区 A 和基表的数据不一致,物化视图的分区 A 上次刷新时间为 1,系统当前时间为 2,那么该分区不会被透明改写。但是如果 `grace_period` 大于等于1,该分区就会被用于透明改写。 `excluded_trigger_tables`:数据刷新时忽略的表名,逗号分割。例如`table1,table2` -`refresh_partition_num`:单次insert语句刷新的分区数量,默认为1。物化视图刷新时会先计算要刷新的分区列表,然后根据该配置拆分成多个insert语句顺序执行。遇到失败的insert语句,整个任务将停止执行。物化视图保证单个insert语句的事务性,失败的insert语句不会影响到已经刷新成功的分区。 +`refresh_partition_num`:单次 insert 语句刷新的分区数量,默认为 1。物化视图刷新时会先计算要刷新的分区列表,然后根据该配置拆分成多个 insert 语句顺序执行。遇到失败的 insert 语句,整个任务将停止执行。物化视图保证单个 insert 语句的事务性,失败的 insert 语句不会影响到已经刷新成功的分区。 -`workload_group`:物化视图执行刷新任务时使用的workload_group名称。用来限制物化视图刷新数据使用的资源,避免影响到其它业务的运行。workload_group创建及使用 [WORKLOAD-GROUP](../../../../admin-manual/workload-group.md) +`workload_group`:物化视图执行刷新任务时使用的 `workload_group` 名称。用来限制物化视图刷新数据使用的资源,避免影响到其它业务的运行。关于 `workload_group` 的创建及使用,可参考 [WORKLOAD-GROUP](../../../../admin-manual/workload-group.md) 文档。 -`partition_sync_limit`:当基表的分区字段为时间时(如果是字符串类型的时间,可以设置partition_date_format),可以用此属性配置同步基表的分区范围,配合`partition_sync_time_unit`一起使用。 -例如设置为 2,`partition_sync_time_unit`设置为 MONTH,代表仅同步基表近2个月的分区和数据。最小值为1。 -随着时间的变化物化视图每次刷新时都会自动增删分区,例如物化视图现在有2,3两个月的数据,下个月的时候,会自动删除2月的数据,增加4月的数据。 +`partition_sync_limit`:当基表的分区字段为时间时(如果是字符串类型的时间,可以设置 `partition_date_format`),可以用此属性配置同步基表的分区范围,配合 `partition_sync_time_unit` 一起使用。 +例如设置为 2,`partition_sync_time_unit` 设置为 `MONTH`,代表仅同步基表近 2 个月的分区和数据。最小值为 `1`。 +随着时间的变化物化视图每次刷新时都会自动增删分区,例如物化视图现在有 2,3 两个月的数据,下个月的时候,会自动删除 2 月的数据,增加 4 月的数据。 `partition_sync_time_unit`:时间单位,支持 DAY/MONTH/YEAR(默认DAY) -`partition_date_format`:分区字段的时间格式,例如"%Y-%m-%d" - -##### query - -创建物化视图的查询语句,其结果即为物化视图中的数据 - -不支持随机函数,例如: -```sql + @@ -209,9 +208,9 @@ KEY(k1,k2) SELECT random() as dd,k3 FROM user ``` -### Example +### 示例 -1. 创建一个立即刷新,之后每周刷新一次的物化视图mv1,数据源为hive catalog +1. 创建一个立即刷新,之后每周刷新一次的物化视图 mv1,数据源为 hive catalog ```sql CREATE MATERIALIZED VIEW mv1 BUILD IMMEDIATE REFRESH COMPLETE ON SCHEDULE EVERY 1 WEEK - DISTRIBUTED BY RANDOM BUCKETS 2 - PROPERTIES ( - "replication_num" = "1" - ) + @@ -222,7 +221,7 @@ SELECT random() as dd,k3 FROM user AS SELECT * FROM hive_catalog.db1.user; ``` -2. 创建一个多表join的物化视图 +2. 创建一个多表 Join 的物化视图 ```sql CREATE MATERIALIZED VIEW mv1 BUILD IMMEDIATE REFRESH COMPLETE ON SCHEDULE EVERY 1 WEEK - DISTRIBUTED BY RANDOM BUCKETS 2 - PROPERTIES ( - "replication_num" = "1" - ) + @@ -233,7 +232,6 @@ SELECT random() as dd,k3 FROM user AS select user.k1,user.k3,com.k4 from user join com on user.k1=com.k1; ``` -### Keywords +### 关键词 CREATE, ASYNC, MATERIALIZED, VIEW - diff --git a/versioned_docs/version-2.1/query/view-materialized-view/query-async-materialized-view.md b/versioned_docs/version-2.1/query/view-materialized-view/query-async-materialized-view.md index 9514a5927c7..8f8382ad01c 100644 --- a/versioned_docs/version-2.1/query/view-materialized-view/query-async-materialized-view.md +++ b/versioned_docs/version-2.1/query/view-materialized-view/query-async-materialized-view.md @@ -39,33 +39,33 @@ Using the three tables: lineitem, orders, and partsupp from TPC-H, let's describ a materialized view and using the materialized view for transparent query rewriting. ```sql CREATE TABLE IF NOT EXISTS lineitem ( - l_orderkey integer not null, - l_partkey integer not null, - l_suppkey integer not null, - l_linenumber integer not null, - l_quantity decimalv3(15,2) not null, - l_extendedprice decimalv3(15,2) not null, - l_discount decimalv3(15,2) not null, - l_tax decimalv3(15,2) not null, - l_returnflag char(1) not null, - l_linestatus char(1) not null, - l_shipdate date not null, - l_commitdate date not null, - l_receiptdate date not null, - l_shipinstruct char(25) not null, - l_shipmode char(10) not null, - l_comment varchar(44) not null - ) - DUPLICATE KEY(l_orderkey, l_partkey, l_suppkey, l_linenumber) - PARTITION BY RANGE(l_shipdate) -(FROM ('2023-10-17') TO ('2023-10-20') INTERVAL 1 DAY) - DISTRIBUTED BY HASH(l_orderkey) BUCKETS 3 - PROPERTIES ("replication_num" = "1"); + l_orderkey integer not null, + l_partkey integer not null, + l_suppkey integer not null, + l_linenumber integer not null, + l_quantity decimalv3(15,2) not null, + l_extendedprice decimalv3(15,2) not null, + l_discount decimalv3(15,2) not null, + l_tax decimalv3(15,2) not null, + l_returnflag char(1) not null, + l_linestatus char(1) not null, + l_shipdate date not null, + l_commitdate date not null, + l_receiptdate date not null, + l_shipinstruct char(25) not null, + l_shipmode char(10) not null, + l_comment varchar(44) not null + ) + DUPLICATE KEY(l_orderkey, l_partkey, l_suppkey, l_linenumber) + PARTITION BY RANGE(l_shipdate) +(FROM ('2023-10-17') TO ('2023-11-01') INTERVAL 1 DAY) + DISTRIBUTED BY HASH(l_orderkey) BUCKETS 3 + PROPERTIES ("replication_num" = "1"); insert into lineitem values - (1, 2, 3, 4, 5.5, 6.5, 7.5, 8.5, 'o', 'k', '2023-10-17', '2023-10-17', '2023-10-17', 'a', 'b', 'yyyyyyyyy'), - (2, 4, 3, 4, 5.5, 6.5, 7.5, 8.5, 'o', 'k', '2023-10-18', '2023-10-18', '2023-10-18', 'a', 'b', 'yyyyyyyyy'), - (3, 2, 4, 4, 5.5, 6.5, 7.5, 8.5, 'o', 'k', '2023-10-19', '2023-10-19', '2023-10-19', 'a', 'b', 'yyyyyyyyy'); + (1, 2, 3, 4, 5.5, 6.5, 7.5, 8.5, 'o', 'k', '2023-10-17', '2023-10-17', '2023-10-17', 'a', 'b', 'yyyyyyyyy'), + (2, 4, 3, 4, 5.5, 6.5, 7.5, 8.5, 'o', 'k', '2023-10-18', '2023-10-18', '2023-10-18', 'a', 'b', 'yyyyyyyyy'), + (3, 2, 4, 4, 5.5, 6.5, 7.5, 8.5, 'o', 'k', '2023-10-19', '2023-10-19', '2023-10-19', 'a', 'b', 'yyyyyyyyy'); ``` ```sql CREATE TABLE IF NOT EXISTS orders ( @@ -81,7 +81,7 @@ CREATE TABLE IF NOT EXISTS orders ( ) DUPLICATE KEY(o_orderkey, o_custkey) PARTITION BY RANGE(o_orderdate)( - FROM ('2023-10-17') TO ('2023-10-20') INTERVAL 1 DAY) + FROM ('2023-10-17') TO ('2023-11-01') INTERVAL 1 DAY) DISTRIBUTED BY HASH(o_orderkey) BUCKETS 3 PROPERTIES ("replication_num" = "1"); @@ -356,6 +356,8 @@ Temporary support for the aggregation roll-up functions is as follows: | count(distinct ) | bitmap_union | bitmap_union_count | | bitmap_union | bitmap_union | bitmap_union | | bitmap_union_count | bitmap_union | bitmap_union_count | +| hll_union_agg, approx_count_distinct, hll_cardinality | hll_union 或者 hll_raw_agg | hll_union_agg | + ## Query partial Transparent Rewriting (Coming soon) When the number of tables in the materialized view is greater than the query, if the materialized view @@ -392,9 +394,8 @@ Query statement: LEFT OUTER JOIN orders ON L_ORDERKEY = O_ORDERKEY; ``` -## Union Rewriting (Coming soon) -When the materialized view is not sufficient to provide all the data for the query, it can use Union to return -data by combining the original table and the materialized view. +## Union Rewriting +When a materialized view is insufficient to provide all the data required by a query, a UNION ALL approach can be used to combine data from both the original table and the materialized view for the final result. Currently, the materialized view needs to be a partitioned materialized view, and UNION ALL can be used to supplement the data by applying the filter conditions on the partition fields. For example: **Case 1** @@ -403,44 +404,125 @@ Materialized view definition: ```sql CREATE MATERIALIZED VIEW mv7 -BUILD IMMEDIATE REFRESH AUTO ON SCHEDULE EVERY 1 hour -DISTRIBUTED BY RANDOM BUCKETS 3 -PROPERTIES ('replication_num' = '1') -AS -SELECT - o_orderkey, - o_custkey, - o_orderstatus, - o_totalprice -FROM orders -WHERE o_orderkey > 10; +BUILD IMMEDIATE REFRESH AUTO ON MANUAL +partition by(l_shipdate) +DISTRIBUTED BY RANDOM BUCKETS 2 +PROPERTIES ('replication_num' = '1') +as +select l_shipdate, o_orderdate, l_partkey, + l_suppkey, sum(o_totalprice) as sum_total +from lineitem + left join orders on lineitem.l_orderkey = orders.o_orderkey and l_shipdate = o_orderdate +group by + l_shipdate, + o_orderdate, + l_partkey, + l_suppkey; ``` +When a new partition 2023-10-21 is added to the base table and the materialized view has not yet been refreshed, the result can be returned by combining the materialized view with the original table using UNION ALL. + +```sql +insert into lineitem values + (1, 2, 3, 4, 5.5, 6.5, 7.5, 8.5, 'o', 'k', '2023-10-21', '2023-10-21', '2023-10-21', 'a', 'b', 'yyyyyyyyy'); +``` + + Query statement: ```sql -SELECT - o_orderkey, - o_custkey, - o_orderstatus, - o_totalprice -FROM orders -WHERE o_orderkey > 5; +select l_shipdate, o_orderdate, l_partkey, l_suppkey, sum(o_totalprice) as sum_total +from lineitem + left join orders on lineitem.l_orderkey = orders.o_orderkey and l_shipdate = o_orderdate +group by + l_shipdate, + o_orderdate, + l_partkey, + l_suppkey; ``` Rewriting result: ```sql SELECT * -FROM mv -UNION ALL -SELECT - o_orderkey, - o_custkey, - o_orderstatus, - o_totalprice -FROM orders -WHERE o_orderkey > 5 AND o_orderkey <= 10; +FROM mv7 +union all +select t1.l_shipdate, o_orderdate, t1.l_partkey, t1.l_suppkey, sum(o_totalprice) as sum_total +from (select * from lineitem where l_shipdate = '2023-10-21') t1 + left join orders on t1.l_orderkey = orders.o_orderkey and t1.l_shipdate = o_orderdate +group by + t1.l_shipdate, + o_orderdate, + t1.l_partkey, + t1.l_suppkey; +``` + +Noted: +The materialized view includes a WHERE condition. For example, if the materialized view is constructed with the filter condition WHERE l_shipdate > '2023-10-19' and the query condition is WHERE l_shipdate > '2023-10-18', this situation currently cannot be compensated for using UNION ALL. This will be supported in the future. + +## Nested Materialized View Rewrite +The definition SQL of a materialized view can use another materialized view; this type of materialized view is called a nested materialized view. Theoretically, there is no limit to the number of nested layers. This materialized view can be queried directly or can participate in transparent rewrite operations. Nested materialized views can also be involved in transparent rewrites. + +**Case 1** + +Here is an example to illustrate how nested materialized views work: + +First, create the inner materialized view `mv8_0_inner_mv`. + +```sql +CREATE MATERIALIZED VIEW mv8_0_inner_mv +BUILD IMMEDIATE REFRESH COMPLETE ON MANUAL +DISTRIBUTED BY RANDOM BUCKETS 2 +PROPERTIES ('replication_num' = '1') +AS +select +l_linenumber, +o_custkey, +o_orderkey, +o_orderstatus, +l_partkey, +l_suppkey, +l_orderkey +from lineitem +inner join orders on lineitem.l_orderkey = orders.o_orderkey; +``` + +Create the outer materialized view `mv8_0`. + +```sql +CREATE MATERIALIZED VIEW mv8_0 +BUILD IMMEDIATE REFRESH COMPLETE ON MANUAL +DISTRIBUTED BY RANDOM BUCKETS 2 +PROPERTIES ('replication_num' = '1') +AS +select +l_linenumber, +o_custkey, +o_orderkey, +o_orderstatus, +l_partkey, +l_suppkey, +l_orderkey, +ps_availqty +from mv8_0_inner_mv +inner join partsupp on l_partkey = ps_partkey AND l_suppkey = ps_suppkey; ``` +The following query will result in successful rewrite for both mv8_0_inner_mv and mv8_0. Ultimately, the cost-based optimizer will select mv8_0. + +```sql +select lineitem.l_linenumber +from lineitem +inner join orders on l_orderkey = o_orderkey +inner join partsupp on l_partkey = ps_partkey AND l_suppkey = ps_suppkey +where o_orderstatus = 'o' +``` + +**Note:** + +1. The more layers of nested materialized views, the longer the time it will take for transparent rewriting. It is recommended not to exceed 3 layers of nested materialized views. +2. Transparent rewriting of nested materialized views is disabled by default. See the switch below for enabling it. + + + ## Auxiliary Functions **Data Consistency Issues After Transparent Rewriting** @@ -511,27 +593,290 @@ you can execute the following statement. It will provide a detailed breakdown of | SET enable_materialized_view_rewrite = true; | Enable or disable query transparent rewriting, default is disabled | | SET materialized_view_rewrite_enable_contain_external_table = true; | Whether materialized views participating in transparent rewriting are allowed to contain external tables, default is not allowed | | SET materialized_view_rewrite_success_candidate_num = 3; | Transparently rewrites the successful result set, allowing the maximum number of CBO candidates to participate, the default is 3 | - +| SET enable_materialized_view_union_rewrite = true; | Whether to allow the union of base table and materialized view using UNION ALL when the partitioned materialized view is insufficient to provide all the data required by a query. Default is enabled. | +| SET enable_materialized_view_nest_rewrite = true; | Whether to allow nested rewriting. Default is disabled. | +| SET materialized_view_relation_mapping_max_count = 8; | Maximum number of relation mappings allowed during transparent rewrite. If exceeded, truncation will occur. Relation mapping is typically generated by self-joins in tables and the number is usually the Cartesian product, for example, if there are 3 tables, it may generate 8 combinations. Default is 8. | ## Limitations -- The materialized view definition statement only allows SELECT, FROM, WHERE, JOIN, and GROUP BY clauses. - The input for JOIN can include simple GROUP BY (aggregation on a single table). - Supported types of JOIN operations include INNER and LEFT OUTER JOIN. - Support for other types of JOIN operations will be gradually added. +- Materialized view definition statements are only allowed to include SELECT, FROM, WHERE, JOIN, and GROUP BY clauses. +- The input to JOIN can include simple GROUP BY (single-table aggregation). Supported JOIN types include INNER, LEFT OUTER JOIN, +- RIGHT OUTER JOIN, FULL OUTER JOIN, LEFT SEMI JOIN, RIGHT SEMI JOIN, LEFT ANTI JOIN, and RIGHT ANTI JOIN. -- Materialized views based on External Tables do not guarantee strong consistency in query results. +- Materialized views based on External Tables do not guarantee strong consistency of query results. -- The use of non-deterministic functions to build materialized views is not supported, - including rand, now, current_time, current_date, random, uuid, etc. +- The use of non-deterministic functions to construct materialized views is not supported, including rand, now, current_time, +- current_date, random, uuid, etc. - Transparent rewriting does not support window functions. -- There is LIMIT in queries and materialized views, and transparent rewriting is not supported for the time being. +- Materialized views with LIMIT are currently not supported for transparent rewriting. - Currently, materialized view definitions cannot utilize views or other materialized views. - When the query or materialized view has no data, transparent rewriting is not supported. -- Currently, WHERE clause compensation supports scenarios where the materialized view does not have a WHERE clause, - but the query does, or where the materialized view has a WHERE clause and the query's WHERE clause is a superset - of the materialized view's. Range condition compensation is not yet supported but will be added gradually. +- Currently, WHERE condition compensation only supports compensating conditions on numeric and date type columns. + For example, if the materialized view is defined as a > 5 and the query is a > 10, transparent rewriting is supported. + +## Frequently Asked Questions + +### 1. Why isn't the materialized view being used? +To determine why a materialized view is not being used, execute the following SQL: + +`explain your_query_sql;` + +a. The transparent rewriting feature for materialized views is disabled by default. You need to enable the corresponding switch for it to work. See the related switches for asynchronous materialized views. + +b. The materialized view may not be available, causing transparent rewriting to fail. Check the status of materialized view construction, see problem 2. + +c. After checking the first two steps, if the materialized view is still not being used, it may be because the definition SQL of the materialized view and the query SQL are not within the current capability of the materialized view rewriting. See the capabilities of materialized view transparent rewriting. + +### 2. How to check if the materialized view status is normal? +#### 2.1 Confirm the Materialized View Construction Status +To participate in transparent rewriting, the status of the materialized view must be Success. First, run the following SQL to check the JobName of the materialized view: + +`select * from mv_infos('database'='db_name') where Name = 'mv_name'` + +Next, use the JobName to check the task status of the materialized view. Run the following SQL: + +`select * from tasks("type"="mv") where JobName = 'job_name';` + +Check if the status of the most recent task execution is Success. + +#### 2.2 Confirm the Availability of Consistent Materialized View Data +If the materialized view is successfully built but is unavailable due to data changes and the `grace_period` setting, +confirm the availability of consistent materialized view data. + +**For Full Refresh Materialized Views:** + +Run the following SQL and check if the `SyncWithBaseTables` field is 1: + +`select * from mv_infos('database'='db_name') where Name = 'mv_name'` + +**For Partitioned Materialized Views:** + +Run the following SQL to check if the partitions used in the query are valid: +`show partitions from mv_name;` + +### 3. Error During Materialized View Construction +Error Message: + +`ERROR 1105 (HY000): errCode = 2, detailMessage = Syntax error in line 1: +BUILD IMMEDIATE REFRESH AUTO ON MANUAL` + +a. The statement for asynchronous materialized views is only supported under the new optimizer. +Ensure that the new optimizer is enabled: + +`SET global enable_nereids_planner = true;` + +b. It's possible that there is a typo in the keywords used in the statement to build the materialized view, or there +may be syntax errors in the materialized view definition SQL. Check the materialized view definition SQL and the create +materialized view statement to ensure correctness. + +### 4. Error: Unable to Find a Suitable Base Table for Partitioning +This error typically indicates that the SQL definition of the materialized view and the selection of partitioning fields +for the materialized view are not suitable for partitioned incremental updates. Therefore, creating a partitioned +materialized view will result in this error. + +For a materialized view to support partitioned incremental updates, it needs to meet certain requirements. +For more details, refer to the [CREATE ASYNC MATERIALIZED VIEW documentation](../../sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md). + +Here is an example to illustrate the construction of a partitioned materialized view: + +```sql + CREATE TABLE IF NOT EXISTS lineitem ( + l_orderkey INTEGER NOT NULL, + l_partkey INTEGER NOT NULL, + l_suppkey INTEGER NOT NULL, + l_linenumber INTEGER NOT NULL, + l_quantity DECIMALV3(15,2) NOT NULL, + l_extendedprice DECIMALV3(15,2) NOT NULL, + l_discount DECIMALV3(15,2) NOT NULL, + l_tax DECIMALV3(15,2) NOT NULL, + l_returnflag CHAR(1) NOT NULL, + l_linestatus CHAR(1) NOT NULL, + l_shipdate DATE NOT NULL, + l_commitdate DATE NOT NULL, + l_receiptdate DATE NOT NULL, + l_shipinstruct CHAR(25) NOT NULL, + l_shipmode CHAR(10) NOT NULL, + l_comment VARCHAR(44) NOT NULL + ) + DUPLICATE KEY(l_orderkey, l_partkey, l_suppkey, l_linenumber) + PARTITION BY RANGE(l_shipdate) ( + PARTITION `day_1` VALUES LESS THAN ('2023-12-9'), + PARTITION `day_2` VALUES LESS THAN ("2023-12-11"), + PARTITION `day_3` VALUES LESS THAN ("2023-12-30")) + DISTRIBUTED BY HASH(l_orderkey) BUCKETS 3 + PROPERTIES ( + "replication_num" = "1" + ); +``` + +```sql + CREATE TABLE IF NOT EXISTS orders ( + o_orderkey INTEGER NOT NULL, + o_custkey INTEGER NOT NULL, + o_orderstatus CHAR(1) NOT NULL, + o_totalprice DECIMALV3(15,2) NOT NULL, + o_orderdate DATE NOT NULL, + o_orderpriority CHAR(15) NOT NULL, + o_clerk CHAR(15) NOT NULL, + o_shippriority INTEGER NOT NULL, + O_COMMENT VARCHAR(79) NOT NULL + ) + DUPLICATE KEY(o_orderkey, o_custkey) + PARTITION BY RANGE(o_orderdate) ( + PARTITION `day_2` VALUES LESS THAN ('2023-12-9'), + PARTITION `day_3` VALUES LESS THAN ("2023-12-11"), + PARTITION `day_4` VALUES LESS THAN ("2023-12-30") + ) + DISTRIBUTED BY HASH(o_orderkey) BUCKETS 3 + PROPERTIES ( + "replication_num" = "1" + ); +``` +If l_shipdate is the partition field of the base table lineitem, the following materialized view can be incrementally +updated by partition. + +```sql +CREATE MATERIALIZED VIEW mv9 +BUILD IMMEDIATE REFRESH AUTO ON MANUAL +partition by(l_shipdate) +DISTRIBUTED BY HASH(l_orderkey) BUCKETS 10 +PROPERTIES ('replication_num' = '1') +AS +SELECT l_shipdate, l_orderkey, O_ORDERDATE, + count(O_ORDERDATE) over (partition by l_shipdate order by l_orderkey) as window_count +FROM lineitem +LEFT OUTER JOIN orders on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_orderkey, O_ORDERDATE; +``` + +The following materialized view cannot be incrementally updated by partition because l_shipdate is generated from the +right side of a LEFT OUTER JOIN and may produce null values. + +```sql +CREATE MATERIALIZED VIEW mv10 +BUILD IMMEDIATE REFRESH AUTO ON MANUAL +partition by(l_shipdate) +DISTRIBUTED BY HASH(l_orderkey) BUCKETS 10 +PROPERTIES ('replication_num' = '1') +AS +SELECT l_shipdate, l_orderkey, O_ORDERDATE, + count(O_ORDERDATE) over (partition by l_shipdate order by l_orderkey) as window_count +FROM orders +LEFT OUTER JOIN lineitem on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_orderkey, O_ORDERDATE; +``` + +### 5. The materialized view returns no data upon direct query? +The materialized view might still be under construction or the construction process might have failed. Use the following +statement to check the status of materialized view construction: + +```sql + -- 查看物化视图元数据信息,database 为当前数据库, mv_name 为物化视图名称 + select * from mv_infos('database'='db_name') where Name = 'mv_name' \G +``` + +```sql +-- 查看任务元数据 +select * from jobs("type"="mv") order by CreateTime limit 5; +``` +```sql +-- 查看任务执行信息,这里面会展示任务执行的状态,如果失败会有失败原因 +select * from tasks("type"="mv") where JobName = 'job_name'; +``` + +### 6. What happens when the data in the base tables used by the materialized view changes before the materialized view is refreshed? +The timeliness of data in asynchronous materialized views has a certain delay compared to the base tables. + +For internal tables and external tables that can perceive data changes (such as Hive tables), +whether a materialized view can be used for transparent rewriting when the data in the base tables changes +before the materialized view is refreshed depends on the threshold set by `grace_period`. + +`grace_period` refers to the time allowance for the materialized view to be inconsistent with the data from the base tables. + +For example, if `grace_period` is set to 0, it means that the materialized view must be consistent with the data in the base +tables before it can be used for transparent rewriting. + +For external tables (except Hive tables), since data changes cannot be perceived, the materialized view can be used for transparent rewriting regardless of whether the data in the external tables is up-to-date (in this case, data inconsistency may occur). + +If `grace_period` is set to 10, it means that the materialized view and the base table data are allowed to have a delay of 10 seconds. If there is a delay in the data of the materialized view compared to the base tables within 10 seconds, the materialized view can still be used for transparent rewriting. + +For partitioned materialized views, if some partitions become invalid, there are two scenarios: + +a. If the query does not use data from the invalid partitions, the materialized view can still be used for transparent rewriting. + +b. If the query uses data from the invalid partitions and the data is within the grace_period, the materialized view can still be used. If the data in the materialized view is not within the grace_period, the query can be responded to by using UNION ALL with the original table and the materialized view. + +### 7. How to confirm if the materialized view is hit, and how to check the reason if it's not hit? + +You can use the `explain query_sql` command to see a summary of whether the materialized view is hit or not. For example, +consider the following materialized view: + +```sql +CREATE MATERIALIZED VIEW mv11 +BUILD IMMEDIATE REFRESH AUTO ON MANUAL +partition by(l_shipdate) +DISTRIBUTED BY HASH(l_orderkey) BUCKETS 10 +PROPERTIES ('replication_num' = '1') +AS +SELECT l_shipdate, l_orderkey, O_ORDERDATE, count(*) +FROM lineitem +LEFT OUTER JOIN orders on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_orderkey, O_ORDERDATE; +``` + +Now, let's analyze the following query: + +```sql +explain +SELECT l_shipdate, l_linestatus, O_ORDERDATE, count(*) +FROM orders +LEFT OUTER JOIN lineitem on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_linestatus, O_ORDERDATE; +``` + +The explain command will show information about whether the materialized view is hit or not. +If it's not hit, it will provide a summary of the failure reason. For example: +```text +| MaterializedView | +| MaterializedViewRewriteSuccessAndChose: | +| | +| MaterializedViewRewriteSuccessButNotChose: | +| | +| MaterializedViewRewriteFail: | +| Name: internal#doc_test#mv11 | +| FailSummary: View struct info is invalid, The graph logic between query and view is not consistent +``` + +In this case, the failure reason is `The graph logic between query and view is not consistent`, +which means that the join order in the query is not consistent with the join order in the materialized view. + +Let's consider another query: + +```sql +explain +SELECT l_shipdate, l_linestatus, O_ORDERDATE, count(*) +FROM lineitem +LEFT OUTER JOIN orders on l_orderkey = o_orderkey +GROUP BY l_shipdate, l_linestatus, O_ORDERDATE; +``` + +If this query also fails to hit the materialized view, the summary might be: + +```text +| MaterializedView | +| MaterializedViewRewriteSuccessAndChose: | +| | +| MaterializedViewRewriteSuccessButNotChose: | +| | +| MaterializedViewRewriteFail: | +| Name: internal#doc_test#mv11 | +| FailSummary: View struct info is invalid, View dimensions doesn't not cover the query dimensions +``` + +In this case, the failure reason is `View dimensions doesn't not cover the query dimensions`, indicating that +the fields used in the `GROUP BY` clause of the query cannot be obtained from the `GROUP BY` clause +of the materialized view. \ No newline at end of file diff --git a/versioned_docs/version-2.1/sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md b/versioned_docs/version-2.1/sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md index f1fd88afd57..fd93816e595 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md +++ b/versioned_docs/version-2.1/sql-manual/sql-statements/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md @@ -101,15 +101,15 @@ AUTO:Try to refresh incrementally as much as possible. If incremental refresh The SQL definition and partition fields of the materialized view need to meet the following conditions for partition incremental updates: -- At least one of the base tables used by the materialized view must be a partitioned table. -- The partitioned tables used by the materialized view must employ list or range partitioning strategies. -- The top-level partition column in the materialized view can only have one partition field. -- The SQL of the materialized view needs to use partition columns from the base table, such as after the SELECT clause. -- If GROUP BY is used, the partition column fields must be after the GROUP BY. -- If window functions are used, the partition column fields must be after the PARTITION BY. -- Data changes should occur on partitioned tables. If they occur on non-partitioned tables, the materialized view needs to be fully rebuilt. -- Using the fields that generate nulls in the JOIN as partition fields in the materialized view prohibits partition incremental updates. -- The null attribute of partition fields in the base table used by the materialized view must not be empty if it comes from an internal table. If it comes from an external table in Hive, the null attribute of the base table can be empty. +- At least one of the base tables used by the materialized view is a partitioned table. +- The base tables used by the materialized view must use list or range partitioning strategies. +- The `partition by` clause in the SQL definition of the materialized view can only have one partitioning column. +- The partitioning column specified in the `partition by` clause of the materialized view's SQL must come after the `SELECT` statement. +- If the materialized view's SQL includes a `group by` clause, the columns used for partitioning must come after the `group by` clause. +- If the materialized view's SQL includes window functions, the columns used for partitioning must come after the `partition by` clause. +- Data changes should occur on partitioned tables. If data changes occur on non-partitioned tables, the materialized view needs to be fully rebuilt. +- If a field from the null-generating side of a join is used as a partitioning column for the materialized view, it cannot be incrementally updated by partition. For example, for a LEFT OUTER JOIN, the partitioning column must be on the left side, not the right. + ```sql refreshMethod --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org