This is an automated email from the ASF dual-hosted git repository. lihaopeng pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 47f3aca6e33 [Doc](exec) change the doc by code review (#1680) 47f3aca6e33 is described below commit 47f3aca6e33ce368c18685a4eb3c88a2342cbf37 Author: HappenLee <happen...@selectdb.com> AuthorDate: Mon Dec 30 23:30:24 2024 +0800 [Doc](exec) change the doc by code review (#1680) change the doc of [parallelism-tuning.md](https://github.com/apache/doris-website/compare/docs?expand=1#diff-0e61462128c042c34ffd872e523198a79dce1e7ba8680404326360dbe344d1c9) ## Versions - [x] dev - [ ] 3.0 - [ ] 2.1 - [ ] 2.0 ## Languages - [x] Chinese - [x] English ## Docs Checklist - [x] Checked by AI - [ ] Test Cases Built --- .../tuning/parallelism-tuning.md | 77 ------------ .../tuning-execution/parallelism-adjustment.md | 96 --------------- .../tuning/tuning-execution/parallelism-tuning.md | 118 +++++++++++++++++++ .../tuning-execution/parallelism-adjustment.md | 98 ---------------- .../tuning/tuning-execution/parallelism-tuning.md | 130 +++++++++++++++++++++ sidebars.json | 5 +- 6 files changed, 250 insertions(+), 274 deletions(-) diff --git a/docs/query-acceleration/tuning/parallelism-tuning.md b/docs/query-acceleration/tuning/parallelism-tuning.md deleted file mode 100644 index 1c7d9153eab..00000000000 --- a/docs/query-acceleration/tuning/parallelism-tuning.md +++ /dev/null @@ -1,77 +0,0 @@ ---- -{ -"title": "Parallelism Tuning", -"language": "en" -} ---- - -<!-- -Licensed to the Apache Software Foundation (ASF) under one -or more contributor license agreements. See the NOTICE file -distributed with this work for additional information -regarding copyright ownership. The ASF licenses this file -to you under the Apache License, Version 2.0 (the -"License"); you may not use this file except in compliance -with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, -software distributed under the License is distributed on an -"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -KIND, either express or implied. See the License for the -specific language governing permissions and limitations -under the License. ---> - -# Parallelism Tuning - -Doris's query is an MPP execution framework, and each query will be executed in parallel on multiple BEs. At the same time, within a single BE, a multi-threaded parallel approach is also adopted to accelerate the query execution efficiency. Currently, all statements (including Query, DML, DDL) support parallel execution. - -The control parameter for the parallelism within a single BE is: parallel_pipeline_task_num, which refers to the number of working tasks used when a single Fragment is executed. - -## Principles of Parallelism Tuning - -The purpose of setting `parallel_pipeline_task_num` is to fully utilize multi-core resources and reduce query latency. However, in order to achieve multi-core parallel execution, some data Shuffle operators and multi-thread synchronization logic are usually introduced, which also brings some unnecessary resource waste. - -The default value in Doris is 0, which is half of the number of CPU cores of the BE. This value takes into account the resource utilization of single queries and concurrent situations, and usually does not require user intervention for adjustment. When there is a performance bottleneck, the following examples can be referred to for necessary adjustments. Doris is continuously improving the adaptive strategy, and it is usually recommended to make necessary adjustments in specific scenario [...] - -### Examples - -Suppose the number of CPU cores of the BE is 16: - -1. For simple operations on a single table (such as single-table point query, where scan to obtain a small amount of data, limit a small amount of data, hit a materialized view), **the parallelism can be set to 1**. - Explanation: For simple operations on a single table, there is only one Fragment. The bottleneck of the query is usually in the data scanning and processing. The data scanning thread and the query execution thread are separated. The data scanning thread will adaptively perform parallel scanning. The bottleneck here is not the query thread, so the parallelism can be directly set to 1. -2. For queries with two-table JOIN/aggregate queries, if the data volume is very large and it is confirmed to be a CPU bottleneck query, **the parallelism can be set to 16**. - Explanation: For two-table JOIN/aggregate queries, which are data computation-intensive queries, if it is observed that the CPU is not fully utilized, consider further increasing the parallelism on the basis of the default value, using the parallel ability of the Pipeline execution engine to fully utilize CPU resources for computation. It cannot be guaranteed that each PipelineTask can use the CPU resources allocated to it to the limit. Therefore, the parallelism can be appropriately [...] -3. For the stress testing scenario, if the multiple query tasks in the stress test can fully utilize the CPU, consider **setting the parallelism to 1**. - Explanation: For the stress testing scenario, there are enough query tasks in the stress test. Excessive parallelism also brings thread scheduling overhead and framework scheduling overhead. It is more reasonable to set it to 1 here. -4. For complex queries, it is necessary to flexibly adjust according to the Profile and machine load. Here, it is recommended to use the default value. If it is not appropriate, you can try to adjust it in a stepwise manner of 4-2-1 and observe the query performance and machine load. - -## Methods of Parallelism Tuning - -Doris can manually specify the parallelism of a query to adjust the parallel execution efficiency when the query is executed. - -### SQL Level Adjustment: - -Use SQL HINT to specify the parallelism of a single SQL, so that the parallelism of different SQLs can be flexibly controlled to achieve the best execution effect. - -```SQL -select /*SET_VAR("parallel_pipeline_task_num=8")*/ * from nation, lineitem where lineitem.l_suppkey = nation.n_nationkey -``` - -### Session Level Adjustment: - -Adjust the parallelism at the session level through session variables. All query statements in the session will be executed with the specified parallelism. Please note that even a single-line query SQL will use this parallelism, which may lead to performance degradation. - -```SQL -set parallel_pipeline_task_num = 8; -``` - -### Global Adjustment: - -If global adjustment is required, usually involving the adjustment of CPU utilization, the parallelism can be set globally. - -```SQL -set global parallel_pipeline_task_num = 8; -``` diff --git a/docs/query-acceleration/tuning/tuning-execution/parallelism-adjustment.md b/docs/query-acceleration/tuning/tuning-execution/parallelism-adjustment.md deleted file mode 100644 index 2555c0bc1be..00000000000 --- a/docs/query-acceleration/tuning/tuning-execution/parallelism-adjustment.md +++ /dev/null @@ -1,96 +0,0 @@ ---- -{ -"title": "Parallelism Adjustment", -"language": "en" -} ---- - -<!-- -Licensed to the Apache Software Foundation (ASF) under one -or more contributor license agreements. See the NOTICE file -distributed with this work for additional information -regarding copyright ownership. The ASF licenses this file -to you under the Apache License, Version 2.0 (the -"License"); you may not use this file except in compliance -with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, -software distributed under the License is distributed on an -"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -KIND, either express or implied. See the License for the -specific language governing permissions and limitations -under the License. ---> - -# Parallelism Adjustment - -## Overview - -In actual production scenarios, performance issues often occur due to unreasonable parallelism settings. The following cases list examples of optimization by adjusting parallelism. - -## Case 1: High Parallelism Leads to High CPU Usage in High Concurrency Pressure Scenarios - -When high CPU usage is observed online, affecting the performance of some low-latency queries, consider reducing CPU usage by adjusting the query parallelism. Since Doris's design concept is to prioritize using more resources to obtain query results as quickly as possible, in some scenarios with tight online resources, this may lead to poor performance. Therefore, appropriate adjustment of parallelism can improve the overall stability and efficiency of queries under limited resources. -Set the parallelism from the default 0 (half of the number of CPU cores) to 4: - -```SQL -set global parallel_pipeline_task_num = 4; -``` - -Since this parameter takes effect at the session level, consider restarting the FE to make the setting globally effective if necessary. -After the adjustment, the CPU usage was reduced to 60% of the previous peak value, reducing the impact on some low-latency queries. - -## Case 2: Increase Parallelism to Further Utilize CPU to Accelerate Queries - -The current default parallelism in Doris is half of the number of CPU cores, and some compute-intensive scenarios cannot fully utilize the CPU for query acceleration. - -```SQL -select sum(if(t2.value is null, 0, 1)) exist_value, sum(if(t2.value is null, 1, 0)) no_exist_value -from t1 left join t2 on t1.key = t2.key; -``` - -In a scenario with 2 billion rows in the left table and 5 million rows in the right table, the above SQL takes 28 seconds to execute. Observe the Profile: - -```SQL - HASH_JOIN_OPERATOR (id=3, nereids_id=448): - - PlanInfo - - join op: LEFT OUTER JOIN(BROADCAST)[] - - equal join conjunct: (value = value) - - cardinality=2,462,330,332 - - vec output tuple id: 5 - - output tuple id: 5 - - vIntermediate tuple ids: 4 - - hash output slot ids: 16 - - projections: value - - project output tuple id: 5 - - BlocksProduced: sum 360.099K (360099), avg 45.012K (45012), max 45.014K (45014), min 45.011K (45011) - - CloseTime: avg 8.44us, max 13.327us, min 5.574us - - ExecTime: avg 26sec153ms, max 26sec261ms, min 26sec33ms - - InitTime: avg 7.122us, max 13.395us, min 4.541us - - MemoryUsage: sum, avg, max, min - - PeakMemoryUsage: sum 1.16 MB, avg 148.00 KB, max 148.00 KB, min 148.00 KB - - ProbeKeyArena: sum 1.16 MB, avg 148.00 KB, max 148.00 KB, min 148.00 KB - - OpenTime: avg 2.967us, max 4.120us, min 1.562us - - ProbeRows: sum 1.4662330332B (1462330332), avg 182.791291M (182791291), max 182.811875M (182811875), min 182.782658M (182782658) - - ProjectionTime: avg 165.392ms, max 169.762ms, min 161.727ms - - RowsProduced: sum 1.462330332B (1462330332), avg 182.791291M (182791291), max 182.811875M (182811875), min 182.782658M (182782658) -``` - -The main time-consuming part here: `ExecTime: avg 26sec153ms, max 26sec261ms, min 26sec33ms` all occurs in the Join operator, and the total amount of data processed: `ProbeRows: sum 1.4662330332B` is 1.4 billion. This is a typical CPU-intensive computing situation. By observing the machine monitoring, it is found that the CPU resources are not fully utilized, and the CPU utilization is 60%. At this time, consider increasing the parallelism to further utilize the idle CPU resources for ac [...] - -Set the parallelism as follows: - -```SQL -set parallel_pipeline_task_num = 16; -``` - -The query time was reduced from 28 seconds to 19 seconds, and the CPU utilization increased from 60% to 90%. - -## Summary - -Usually, users do not need to intervene in adjusting the query parallelism. If adjustment is required, the following points should be noted: - -1. It is recommended to start from CPU utilization. Observe whether it is a CPU bottleneck through the output of the PROFILE tool and try to make reasonable modifications to the parallelism. -2. Adjusting a single SQL is relatively safe. Try not to make overly aggressive global modifications. \ No newline at end of file diff --git a/docs/query-acceleration/tuning/tuning-execution/parallelism-tuning.md b/docs/query-acceleration/tuning/tuning-execution/parallelism-tuning.md new file mode 100644 index 00000000000..4aed179f8a8 --- /dev/null +++ b/docs/query-acceleration/tuning/tuning-execution/parallelism-tuning.md @@ -0,0 +1,118 @@ +# Parallelism Tuning + +## Overview + +Doris queries are executed in an MPP (Massively Parallel Processing) framework, where each query is executed in parallel across multiple BEs (Backend Executors). Meanwhile, within a single BE, a multi-threaded parallel approach is adopted to enhance query execution efficiency. Currently, all types of statements, including queries, DML (Data Manipulation Language), and DDL (Data Definition Language), support parallel execution. + +The control parameter for parallelism within a single BE is `parallel_pipeline_task_num`, which refers to the number of working tasks used by a single Fragment during execution. In actual production scenarios, performance issues may arise due to improper parallelism settings. The following examples illustrate cases of optimizing parallelism. + +## Principles of Parallelism Tuning + +The purpose of setting `parallel_pipeline_task_num` is to fully utilize multi-core resources and reduce query latency. However, to enable multi-core parallel execution, some data shuffle operators and synchronization logic between multiple threads are usually introduced, which may also lead to unnecessary resource wastage. + +The default value in Doris is 0, which is half the number of CPU cores of the BE. This value takes into account the resource utilization of both single queries and concurrent operations, and usually does not require user intervention for adjustment. When there is a performance bottleneck, refer to the following examples for necessary adjustments. Doris is continuously improving its adaptive strategy, and it is usually recommended to make necessary adjustments in specific scenarios or at [...] + +### Examples + +Suppose the BE has 16 CPU cores: + +1. For simple operations on a single table (such as single-table point queries, `WHERE` clause scans to retrieve a small amount of data, `LIMIT` a small amount of data, or hitting a materialized view), **the parallelism can be set to 1**. + Explanation: Simple operations on a single table involve only one Fragment. The bottleneck of such queries usually lies in data scanning and processing. The data scanning thread and the query execution thread are separated, and the data scanning thread will perform parallel scanning adaptively. Here, the bottleneck is not the query thread, so the parallelism can be directly set to 1. +2. For queries involving two-table `JOIN` or aggregation queries, if the data volume is large and it is confirmed to be a CPU-bound query, **the parallelism can be set to 16**. + Explanation: For two-table `JOIN` or aggregation queries, which are data computation-intensive queries, if the CPU is not fully utilized, consider increasing the parallelism on the basis of the default value to take advantage of the parallel capabilities of the Pipeline execution engine and fully utilize CPU resources for computation. It cannot be guaranteed that each PipelineTask can utilize the allocated CPU resources to the fullest. Therefore, the parallelism can be adjusted approp [...] +3. In a stress testing scenario, where the multiple queries in the stress test can fully utilize the CPU, **the parallelism can be set to 1**. + Explanation: In a stress testing scenario, there are sufficient query tasks. Excessive parallelism also brings thread scheduling overhead and framework scheduling overhead. Setting it to 1 is more reasonable in this case. +4. For complex queries, the parallelism should be adjusted flexibly based on the Profile and machine load. Here, it is recommended to use the default value. If it is not suitable, a stepwise adjustment of 4-2-1 can be tried, and the query performance and machine load should be observed. + +## Methods of Parallelism Tuning + +Doris allows users to manually specify the parallelism of a query to adjust the parallel execution efficiency during query execution. + +### SQL Level Adjustment: + +Use SQL HINT to specify the parallelism of a single SQL statement. This allows for flexible control of the parallelism of different SQL statements to achieve the best execution results. + +```sql +select /*SET_VAR("parallel_pipeline_task_num=8")*/ * from nation, lineitem where lineitem.l_suppkey = nation.n_nationkey +``` + +### Session Level Adjustment: + +Adjust the parallelism at the session level through session variables. All query statements in the session will be executed with the specified parallelism. Please note that even single-line SQL queries will use this parallelism, which may lead to performance degradation. + +```SQL +set parallel_pipeline_task_num = 8; +``` + +### Global Adjustment: + +If global adjustment is required, usually involving CPU utilization adjustment, the parallelism can be set globally. + +```SQL +set global parallel_pipeline_task_num = 8; +``` + +## Case 1: High Parallelism Leading to High CPU Usage in a High-Concurrency Pressure Scenario + +When observing high CPU usage online, which affects the performance of some low-latency queries, consider adjusting the query parallelism to reduce CPU usage. Since Doris's design philosophy is to prioritize using more resources to obtain query results as quickly as possible, in some scenarios with tight online resources, this may lead to poor performance. Therefore, appropriate adjustment of parallelism can improve the overall stability and efficiency of queries under limited resources. + +Set the parallelism from the default value of 0 (half the number of CPU cores) to 4: + +```SQL +set global parallel_pipeline_task_num = 4; +``` + +After the global setting, it takes effect for the current connection and new connections. Existing other connections are not affected. If immediate global effect is required, the FE (Frontend) can be restarted. After the adjustment, the CPU usage is reduced to 60% of the previous peak value, reducing the impact on some low-latency queries. + +## Case 2: Increasing Parallelism to Further Utilize the CPU for Query Acceleration + +The current default parallelism in Doris is half the number of CPU cores, and some computation-intensive scenarios cannot fully utilize the CPU for query acceleration. + +```SQL +select sum(if(t2.value is null, 0, 1)) exist_value, sum(if(t2.value is null, 1, 0)) no_exist_value +from t1 left join t2 on t1.key = t2.key; +``` + +In a scenario with 2 billion rows in the left table and 5 million rows in the right table, the above SQL takes 28 seconds to execute. Observe the Profile: + +```SQL +HASH_JOIN_OPERATOR (id=3, nereids_id=448): + - PlanInfo + - join op: LEFT OUTER JOIN(BROADCAST)[] + - equal join conjunct: (value = value) + - cardinality=2,462,330,332 + - vec output tuple id: 5 + - output tuple id: 5 + - vIntermediate tuple ids: 4 + - hash output slot ids: 16 + - projections: value + - project output tuple id: 5 + - BlocksProduced: sum 360.099K (360099), avg 45.012K (45012), max 45.014K (45014), min 45.011K (45011) + - CloseTime: avg 8.44us, max 13.327us, min 5.574us + - ExecTime: avg 26sec153ms, max 26sec261ms, min 26sec33ms + - InitTime: avg 7.122us, max 13.395us, min 4.541us + - MemoryUsage: sum, avg, max, min + - PeakMemoryUsage: sum 1.16 MB, avg 148.00 KB, max 148.00 KB, min 148.00 KB + - ProbeKeyArena: sum 1.16 MB, avg 148.00 KB, max 148.00 KB, min 148.00 KB + - OpenTime: avg 2.967us, max 4.120us, min 1.562us + - ProbeRows: sum 1.4662330332B (1462330332), avg 182.791291M (182791291), max 182.811875M (182811875), min 182.782658M (182782658) + - ProjectionTime: avg 165.392ms, max 169.762ms, min 161.727ms + - RowsProduced: sum 1.462330332B (1462330332), avg 182.791291M (182791291), max 182.811875M (182811875), min 182.782658M (182782658) +``` + +The main time-consuming part here: `ExecTime: avg 26sec153ms, max 26sec261ms, min 26sec33ms` all occurs in the Join operator, and the total amount of data processed: `ProbeRows: sum 1.4662330332B` is 1.4 billion, which is a typical CPU-intensive computation scenario. Observing the machine monitoring, it is found that the CPU resources are not fully utilized, with a CPU utilization rate of 60%. At this time, consider increasing the parallelism to further utilize the idle CPU resources for [...] + +Set the parallelism as follows: + +```SQL +set parallel_pipeline_task_num = 16; +``` + +The query execution time is reduced from 28 seconds to 19 seconds, and the CPU utilization rate is increased from 60% to 90%. + +## Summary + +Usually, users do not need to adjust the query parallelism. If adjustment is required, the following points should be noted: + +1. It is recommended to start from the CPU utilization. Observe whether it is a CPU bottleneck through the PROFILE tool output and try to make reasonable modifications to the parallelism. +2. Adjusting a single SQL is relatively safe. Try not to make overly aggressive global modifications. \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-acceleration/tuning/tuning-execution/parallelism-adjustment.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-acceleration/tuning/tuning-execution/parallelism-adjustment.md deleted file mode 100644 index e37a6f44eaa..00000000000 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-acceleration/tuning/tuning-execution/parallelism-adjustment.md +++ /dev/null @@ -1,98 +0,0 @@ ---- -{ -"title": "并行度调整", -"language": "zh-CN" -} ---- - -<!-- -Licensed to the Apache Software Foundation (ASF) under one -or more contributor license agreements. See the NOTICE file -distributed with this work for additional information -regarding copyright ownership. The ASF licenses this file -to you under the Apache License, Version 2.0 (the -"License"); you may not use this file except in compliance -with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, -software distributed under the License is distributed on an -"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -KIND, either express or implied. See the License for the -specific language governing permissions and limitations -under the License. ---> - -# 并行度调整 - -## 概述 - -实际生产场景经常会遇到并行度设置不合理,引起的性能问题。在以下的案例中,列举了调整并行度优化的案例。 - -## 案例 1:并行度过高导致高并发压力场景,CPU 使用率过高 - -当线上观察到 CPU 使用率过高,影响到部分低时延查询的性能时,可以考虑通过调整查询并行度来降低 CPU 使用率。由于 Doris 的设计理念是优先使用更多资源以最快速度获取查询结果,在某些线上资源紧张的场景下,可能会导致性能表现不佳。因此,适当调整并行度可以在资源有限的情况下提升查询的整体稳定性和效率。 - -设置并行度从默认的 0(CPU 核数的一半)到 4: - -```SQL -set global parallel_pipeline_task_num = 4; -``` - -由于该参数是 Session 生效,必要时考虑重启 FE 让该设置全局生效。 - -调整之后,CPU 使用率降低到原先高峰值的 60%,降低了部分时延较低的查询的影响。 - -## 案例 2:调高并行度,进一步利用 CPU 加速查询 - -当前 Doris 默认的并行度为 CPU 核数的一半,部分计算密集型的场景并不能充分利用满 CPU 进行查询加速, - -```SQL -select sum(if(t2.value is null, 0, 1)) exist_value, sum(if(t2.value is null, 1, 0)) no_exist_value -from t1 left join t2 on t1.key = t2.key; -``` - -在左表 20 亿,右表 500 万的场景上,上述 SQL 需要执行 28s。观察 Profile: - -```SQL - HASH_JOIN_OPERATOR (id=3 , nereids_id=448): - - PlanInfo - - join op: LEFT OUTER JOIN(BROADCAST)[] - - equal join conjunct: (value = value) - - cardinality=2,462,330,332 - - vec output tuple id: 5 - - output tuple id: 5 - - vIntermediate tuple ids: 4 - - hash output slot ids: 16 - - projections: value - - project output tuple id: 5 - - BlocksProduced: sum 360.099K (360099), avg 45.012K (45012), max 45.014K (45014), min 45.011K (45011) - - CloseTime: avg 8.44us, max 13.327us, min 5.574us - - ExecTime: avg 26sec153ms, max 26sec261ms, min 26sec33ms - - InitTime: avg 7.122us, max 13.395us, min 4.541us - - MemoryUsage: sum , avg , max , min - - PeakMemoryUsage: sum 1.16 MB, avg 148.00 KB, max 148.00 KB, min 148.00 KB - - ProbeKeyArena: sum 1.16 MB, avg 148.00 KB, max 148.00 KB, min 148.00 KB - - OpenTime: avg 2.967us, max 4.120us, min 1.562us - - ProbeRows: sum 1.4662330332B (1462330332), avg 182.791291M (182791291), max 182.811875M (182811875), min 182.782658M (182782658) - - ProjectionTime: avg 165.392ms, max 169.762ms, min 161.727ms - - RowsProduced: sum 1.462330332B (1462330332), avg 182.791291M (182791291), max 182.811875M (182811875), min 182.782658M (182782658) -``` - -这里主要的时间耗时:`ExecTime: avg 26sec153ms, max 26sec261ms, min 26sec33ms`都发生在 Join 算子上,同时处理的数据总量:`ProbeRows: sum 1.4662330332B`有14亿,这是一个典型的 CPU 密集的运算情况。观察机器监控,发现 CPU 资源没有打满,CPU 利用率为 60%,此时可以考虑调高并行度来进一步利用空闲的 CPU 资源进行加速。 - -设置并行度如下: - -```SQL -set parallel_pipeline_task_num = 16; -``` - -查询耗时从 28s 降低到 19s,cpu 利用率从 60%上升到 90%。 - -## 总结 - -通常用户不需要介入调整查询并行度,如需要调整,需要注意以下事项: - -1. 建议从 CPU 利用率出发。通过 PROFILE 工具输出观察是否是 CPU 瓶颈,尝试进行并行度的合理修改 -2. 单 SQL 调整比较安全,尽量不要全局做过于激进的修改 \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-acceleration/tuning/tuning-execution/parallelism-tuning.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-acceleration/tuning/tuning-execution/parallelism-tuning.md new file mode 100644 index 00000000000..433319ae6ef --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-acceleration/tuning/tuning-execution/parallelism-tuning.md @@ -0,0 +1,130 @@ +# 并行度调优 + +## 概述 + +Doris 的查询是一个MPP的执行框架,每一条查询都会在多个BE上并行执行;同时,在单个BE内部也会采用多线程并行的方式来加速查询的执行效率,目前所有的语句(包括Query,DML,DDL)均支持并行执行。 + +单个BE内并行度的控制参数是:parallel_pipeline_task_num,是指单个Fragment在执行时所使用的工作任务数。在实际生产场景会遇到并行度设置不合理,引起的性能问题。在以下的案例中,列举了调整并行度优化的案例。 + + + +## **并行度调优的原则** + +parallel_pipeline_task_num设定目的是为了充分利用多核资源,降低查询的延迟;但是,为了多核并行执行,通常会引入一些数据Shuffle算子,以及多线程之间同步的逻辑,这也会带来一些不必要的资源浪费。 + +Doris中默认值为0,即BE的CPU核数目的一半,这个值考虑了单查询和并发的资源利用的情况,通常不需要用户介入调整。当存在性能瓶颈时可以参考下面示例进行必要的调整。Doris在持续完善自适应的策略,通常建议在特定场景或SQL级别进行必要的调整。 + +### **示例** + +假设BE的CPU核数为16: + + 1.对于单表的简单操作(如单表点差、where扫描获取少量数据,limit少量数据,命中物化视图) **并行度可设置为1** + +说明:单表的简单操作,只有一个Fragment,查询的瓶颈通常在数据扫描处理上,数据扫描线程和查询执行的线程是分开的,数据扫描线程会自适应的做并行的扫描,这里的瓶颈不是查询线程,并行度可以直接设置为1。 + + 2.对于两表 `JOIN` 的查询/聚合查询,如果数据量很大,确认是CPU瓶颈型查询,**并行度可设置为16**。 + +说明:对于两表 `JOIN`/聚合查询,这类数据计算密集型的查询,如果观察CPU没有打满,可以考虑在默认值的基础上,继续调大并行度,利用Pipeline执行引擎的并行能力,充分利用CPU资源参与计算。并不能保证每个PipelineTask都能将分配给它的CPU资源使用到极限。因此,可以适当调整并行度,比如设为16,以更充分地利用 CPU。然而,不应无限制地增加并行度,设置为48根本不会带来实质性的收益,反而会增加线程调度开销和框架调度开销。 + + 3.对于压力测试场景,压测的多个查询的任务本身就能够充分利用CPU,可以考虑**并行度设置为1**。 + +说明:对于压力测试场景,压测的查询的任务足够多。过大的并行度同样带来了线程调度开销和框架调度开销,这里需要设置为1是比较合理的。 + + 4.复杂查询的情况要根据Profile和机器负载,灵活调整,这里建议使用默认值,如果不合适可以尝试4-2-1的阶梯方式调整,观察查询表现和机器负载。 + + + +## **并行度调优的方法** + +Doris可以手动指定查询的并行度,以调整查询执行时并行执行的效率。 + +### **SQL级别调整:** + +通过SQL HINT 来指定单个SQL的并行度,这样可以灵活控制不同SQL的并行度来取得最佳的执行效果 + +```SQL +select /*SET_VAR("parallel_pipeline_task_num=8")*/ * from nation, lineitem where lineitem.l_suppkey = nation.n_nationkey +``` + +#### **会话级别调整:** + +通过session variables来调整会话级别的并行度,session 中的所有查询语句都将以指定的并行度执行。请注意,即使是单行查询的 SQL,也会使用该并行度,可能导致性能下降。 + +```SQL +set parallel_pipeline_task_num = 8; +``` + +#### **全局调整:** + +如果需要全局调整,通常涉及cpu利用率的调整,可以global设置并行度 + +```SQL +set global parallel_pipeline_task_num = 8; +``` + + + +## 案例 1:并行度过高导致高并发压力场景,CPU 使用率过高 + +当线上观察到 CPU 使用率过高,影响到部分低时延查询的性能时,可以考虑通过调整查询并行度来降低 CPU 使用率。由于 Doris 的设计理念是优先使用更多资源以最快速度获取查询结果,在某些线上资源紧张的场景下,可能会导致性能表现不佳。因此,适当调整并行度可以在资源有限的情况下提升查询的整体稳定性和效率。 + +设置并行度从默认的 0(CPU 核数的一半)到 4: + +```SQL +set global parallel_pipeline_task_num = 4; +``` + +global设置后,对于当前链接和新建链接全局生效,已有的其他链接不生效。如果需要即时全部生效,可以重启fe。调整之后,CPU 使用率降低到原先高峰值的 60%,降低了部分时延较低的查询的影响。 + +## 案例 2:调高并行度,进一步利用 CPU 加速查询 + +当前 Doris 默认的并行度为 CPU 核数的一半,部分计算密集型的场景并不能充分利用满 CPU 进行查询加速, + +```SQL +select sum(if(t2.value is null, 0, 1)) exist_value, sum(if(t2.value is null, 1, 0)) no_exist_value +from t1 left join t2 on t1.key = t2.key; +``` + +在左表 20 亿,右表 500 万的场景上,上述 SQL 需要执行 28s。观察 Profile: + +```SQL + HASH_JOIN_OPERATOR (id=3 , nereids_id=448): + - PlanInfo + - join op: LEFT OUTER JOIN(BROADCAST)[] + - equal join conjunct: (value = value) + - cardinality=2,462,330,332 + - vec output tuple id: 5 + - output tuple id: 5 + - vIntermediate tuple ids: 4 + - hash output slot ids: 16 + - projections: value + - project output tuple id: 5 + - BlocksProduced: sum 360.099K (360099), avg 45.012K (45012), max 45.014K (45014), min 45.011K (45011) + - CloseTime: avg 8.44us, max 13.327us, min 5.574us + - ExecTime: avg 26sec153ms, max 26sec261ms, min 26sec33ms + - InitTime: avg 7.122us, max 13.395us, min 4.541us + - MemoryUsage: sum , avg , max , min + - PeakMemoryUsage: sum 1.16 MB, avg 148.00 KB, max 148.00 KB, min 148.00 KB + - ProbeKeyArena: sum 1.16 MB, avg 148.00 KB, max 148.00 KB, min 148.00 KB + - OpenTime: avg 2.967us, max 4.120us, min 1.562us + - ProbeRows: sum 1.4662330332B (1462330332), avg 182.791291M (182791291), max 182.811875M (182811875), min 182.782658M (182782658) + - ProjectionTime: avg 165.392ms, max 169.762ms, min 161.727ms + - RowsProduced: sum 1.462330332B (1462330332), avg 182.791291M (182791291), max 182.811875M (182811875), min 182.782658M (182782658) +``` + +这里主要的时间耗时:`ExecTime: avg 26sec153ms, max 26sec261ms, min 26sec33ms`都发生在 Join 算子上,同时处理的数据总量:`ProbeRows: sum 1.4662330332B`有14亿,这是一个典型的 CPU 密集的运算情况。观察机器监控,发现 CPU 资源没有打满,CPU 利用率为 60%,此时可以考虑调高并行度来进一步利用空闲的 CPU 资源进行加速。 + +设置并行度如下: + +```SQL +set parallel_pipeline_task_num = 16; +``` + +查询耗时从 28s 降低到 19s,cpu 利用率从 60%上升到 90%。 + +## 总结 + +通常用户不需要介入调整查询并行度,如需要调整,需要注意以下事项: + +1. 建议从 CPU 利用率出发。通过 PROFILE 工具输出观察是否是 CPU 瓶颈,尝试进行并行度的合理修改 +2. 单 SQL 调整比较安全,尽量不要全局做过于激进的修改 \ No newline at end of file diff --git a/sidebars.json b/sidebars.json index 5f82890d9e0..e1be6af90b0 100644 --- a/sidebars.json +++ b/sidebars.json @@ -372,11 +372,10 @@ "items": [ "query-acceleration/tuning/tuning-execution/adjustment-of-runtimefilter-wait-time", "query-acceleration/tuning/tuning-execution/data-skew-handling", - "query-acceleration/tuning/tuning-execution/parallelism-adjustment" + "query-acceleration/tuning/tuning-execution/parallelism-tuning" ] }, - "query-acceleration/tuning/tuning-parameters", - "query-acceleration/tuning/parallelism-tuning" + "query-acceleration/tuning/tuning-parameters" ] }, { --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org