(doris) 01/04: [doc](group commit) Add group commit performance (#31343)

yiguolei Fri, 23 Feb 2024 04:45:01 -0800

This is an automated email from the ASF dual-hosted git repository.

yiguolei pushed a commit to branch branch-2.1
in repository https://gitbox.apache.org/repos/asf/doris.git


commit 4085657a28ac1c165a756a284f4ab0b3db3f9d87
Author: meiyi <myime...@gmail.com>
AuthorDate: Fri Feb 23 19:49:15 2024 +0800

    [doc](group commit) Add group commit performance (#31343)
---
 .../import/import-way/group-commit-manual.md       | 92 ++++++++++++++++++++-
 .../import/import-way/group-commit-manual.md       | 94 +++++++++++++++++++++-
 2 files changed, 179 insertions(+), 7 deletions(-)

diff --git a/docs/en/docs/data-operate/import/import-way/group-commit-manual.md 
b/docs/en/docs/data-operate/import/import-way/group-commit-manual.md
index e5f935b8da4..b2c677a8f63 100644
--- a/docs/en/docs/data-operate/import/import-way/group-commit-manual.md
+++ b/docs/en/docs/data-operate/import/import-way/group-commit-manual.md
@@ -335,13 +335,26 @@ curl --location-trusted -u {user}:{passwd} -T data.csv  
-H "group_commit:sync_mo
 
 See [Stream Load](stream-load-manual.md) for more detailed syntax used by 
**Http Stream**.
 
-## Modify the group commit interval
+## Group commit condition
+
+The data will be automatically committed either when the time interval 
(default is 10 seconds) or the data size (default is 64 MB) conditions meet.
+
+### Modify the time interval condition
 
 The default group commit interval is 10 seconds. Users can modify the 
configuration of the table:
 
 ```sql
 # Modify the group commit interval to 2 seconds
-ALTER TABLE dt SET ("group_commit_interval_ms"="2000");
+ALTER TABLE dt SET ("group_commit_interval_ms" = "2000");
+```
+
+### Modify the data size condition
+
+The default group commit data size is 64 MB. Users can modify the 
configuration of the table:
+
+```sql
+# Modify the group commit data size to 128MB
+ALTER TABLE dt SET ("group_commit_data_bytes" = "134217728");
 ```
 
 ## Limitations
@@ -362,7 +375,7 @@ ALTER TABLE dt SET ("group_commit_interval_ms"="2000");
 
   * Two phase commit
 
-  * Specify the label
+  * Specify the label  by set header `-H "label:my_label"`
 
   * Column update
 
@@ -413,3 +426,76 @@ ALTER TABLE dt SET ("group_commit_interval_ms"="2000");
 * Description: The `max_filter_ratio` limit can only work if the total rows of 
`group commit` is less than this value.
 * Default: 10000
 
+## Performance
+
+We have separately tested the write performance of group commit in 
high-concurrency scenarios with small data volumes using `Stream Load` and 
`JDBC` (in `async mode`).
+
+### Stream Load
+
+#### Environment
+
+* 1 FE: 8-core CPU, 16 GB RAM, 1 200 GB SSD disk
+* 3 BE: 16-core CPU, 64 GB RAM, 1 2 TB SSD disk
+* 1 Client: 8-core CPU, 64 GB RAM, 1 100 GB SSD disk
+
+#### DataSet
+
+* `httplogs`, 31 GB, 247249096 (247 million) rows
+
+#### Test Tool
+
+* [doris-streamloader](https://github.com/apache/doris-streamloader)
+
+#### Test Method
+
+* Setting different single-concurrency data size and concurrency num between 
`non group_commit` and `group_commit` modes.
+
+#### Test Result
+
+| Load Way           | Single-concurrency Data Size | Concurrency | Cost 
Seconds | Rows / Seconds | MB / Seconds |
+|--------------------|------------------------------|-------------|--------------------|----------------|--------------|
+| `group_commit`     | 10 KB                        | 10          | 3707       
        | 66,697         | 8.56         |
+| `group_commit`     | 10 KB                        | 30          | 3385       
        | 73,042         | 9.38         |
+| `group_commit`     | 100 KB                       | 10          | 473        
        | 522,725        | 67.11        |
+| `group_commit`     | 100 KB                       | 30          | 390        
        | 633,972        | 81.39        |
+| `group_commit`     | 500 KB                       | 10          | 323        
        | 765,477        | 98.28        |
+| `group_commit`     | 500 KB                       | 30          | 309        
        | 800,158        | 102.56       |
+| `group_commit`     | 1 MB                         | 10          | 304        
        | 813,319        | 104.24       |
+| `group_commit`     | 1 MB                         | 30          | 286        
        | 864,507        | 110.88       |
+| `group_commit`     | 10 MB                        | 10          | 290        
        | 852,583        | 109.28       |
+| `non group_commit` | 1 MB                         | 10          | `-235 
error`       |                |              |
+| `non group_commit` | 10 MB                        | 10          | 519        
        | 476,395        | 61.12        |
+| `non group_commit` | 10 MB                        | 30          | `-235 
error`       |                |              |
+
+In the above test, the CPU usage of BE fluctuates between 10-40%.
+
+The `group_commit` effectively enhances import performance while reducing the 
number of versions, thereby alleviating the pressure on compaction.
+
+### JDBC
+
+#### Environment
+
+* 1 FE: 8-core CPU, 16 GB RAM, 1 200 GB SSD disk
+* 1 BE: 16-core CPU, 64 GB RAM, 1 2 TB SSD disk
+* 1 Client: 16-core CPU, 64 GB RAM, 1 100 GB SSD disk
+
+#### DataSet
+
+* The data of tpch sf10 `lineitem` table, 20 files, 14 GB, 120 million rows
+
+#### Test Method
+
+* [DataX](https://github.com/alibaba/DataX)
+
+#### Test Method
+
+* Use `txtfilereader` wtite data to `mysqlwriter`, config different 
concurrenncy and rows for per `INSERT` sql.
+
+#### Test Result
+
+
+| Rows per insert | Concurrency | Rows / Second | MB / Second |
+|-----------------|-------------|---------------|-------------|
+| 100             | 20          | 106931        | 11.46       |
+
+In the above test, the CPU usage of BE fluctuates between 10-20%, FE 
fluctuates between 60-70%.
diff --git 
a/docs/zh-CN/docs/data-operate/import/import-way/group-commit-manual.md 
b/docs/zh-CN/docs/data-operate/import/import-way/group-commit-manual.md
index 76054ca49cc..e7dfd96d83a 100644
--- a/docs/zh-CN/docs/data-operate/import/import-way/group-commit-manual.md
+++ b/docs/zh-CN/docs/data-operate/import/import-way/group-commit-manual.md
@@ -334,13 +334,26 @@ curl --location-trusted -u {user}:{passwd} -T data.csv  
-H "group_commit:sync_mo
 
 关于 Http Stream 使用的更多详细语法及最佳实践，请参阅 [Stream Load](stream-load-manual.md)。
 
-## 修改group commit默认提交间隔
+## 自动提交条件
 
-group commit 的默认提交间隔为 10 秒，用户可以通过修改表的配置，调整 group commit 的提交间隔：
+当满足时间间隔(默认为 10 秒)或数据量(默认为 64 MB)其中一个条件时，会自动提交数据。
+
+### 修改提交间隔
+
+默认提交间隔为 10 秒，用户可以通过修改表的配置调整：
 
 ```sql
 # 修改提交间隔为 2 秒
-ALTER TABLE dt SET ("group_commit_interval_ms"="2000");
+ALTER TABLE dt SET ("group_commit_interval_ms" = "2000");
+```
+
+### 修改提交数据量
+
+group commit 的默认提交数据量为 64 MB，用户可以通过修改表的配置调整：
+
+```sql
+# 修改提交数据量为 128MB
+ALTER TABLE dt SET ("group_commit_data_bytes" = "134217728");
 ```
 
 ## 使用限制
@@ -361,7 +374,7 @@ ALTER TABLE dt SET ("group_commit_interval_ms"="2000");
 
   + 两阶段提交
 
-  + 指定 label
+  + 指定 label，即通过 `-H "label:my_label"`设置
 
   + 列更新写入
 
@@ -411,3 +424,76 @@ ALTER TABLE dt SET ("group_commit_interval_ms"="2000");
 
 * 描述:  当 group commit 导入的总行数不高于该值，`max_filter_ratio` 正常工作，否则不工作
 * 默认值: 10000
+
+## 性能
+
+我们分别测试了使用`Stream Load`和`JDBC`在高并发小数据量场景下`group commit`(使用`async mode`)的写入性能。
+
+### Stream Load日志场景测试
+
+#### 机器配置
+
+* 1台 FE：8核 CPU、16GB 内存、1块 200GB 通用性 SSD 云磁盘
+* 3台 BE：16核 CPU、64GB 内存、1块 2TB 通用性 SSD 云磁盘
+* 1台测试客户端：16核 CPU、64GB 内存、1块 100GB 通用型 SSD 云磁盘
+
+#### 数据集
+
+* `httplogs`数据集，总共 31GB、2.47亿条
+
+#### 测试工具
+
+* [doris-streamloader](https://github.com/apache/doris-streamloader)
+
+#### 测试方法
+
+* 
对比`非group_commit`和`group_commit`的`async_mode`模式下，设置不同的单并发数据量和并发数，导入`247249096`行数据
+
+#### 测试结果
+
+| 导入方式    | 单并发数据量  | 并发数  | 耗时(秒)     | 导入速率(行/秒) | 导入吞吐(MB/秒) |
+|----------------|---------|------|-----------|----------|-----------|
+| `group_commit` | 10 KB   | 10   | 3707      | 66,697   | 8.56 |
+| `group_commit` | 10 KB   | 30   | 3385      | 73,042   | 9.38 |
+| `group_commit` | 100 KB  | 10   | 473       | 522,725  | 67.11 |
+| `group_commit` | 100 KB  | 30   | 390       | 633,972  | 81.39 |
+| `group_commit` | 500 KB  | 10   | 323       | 765,477  | 98.28 |
+| `group_commit` | 500 KB  | 30   | 309       | 800,158  | 102.56 |
+| `group_commit` | 1 MB    | 10   | 304       | 813,319  | 104.24 |
+| `group_commit` | 1 MB    | 30   | 286       | 864,507  | 110.88 |
+| `group_commit` | 10 MB   | 10   | 290       | 852,583  | 109.28 |
+| `非group_commit` | 1 MB    | 10   | 导入报错-235  |  | |
+| `非group_commit` | 10 MB   | 10   | 519       | 476,395  | 61.12 |
+| `非group_commit` | 10 MB   | 30   | 导入报错-235  |  | |
+
+在上面的`group_commit`测试中，BE的CPU使用率在10-40%之间。
+
+可以看出，`group_commit`模式在小数据量并发导入的场景下，能有效的提升导入性能，同时减少版本数，降低系统合并数据的压力。
+
+### JDBC
+
+#### 机器配置
+
+* 1台 FE：8核 CPU、16 GB 内存、1块 200 GB 通用性 SSD 云磁盘
+* 1台 BE：16核 CPU、64 GB 内存、1块 2 TB 通用性 SSD 云磁盘
+* 1台测试客户端：16核 CPU、64GB内存、1块 100 GB 通用型 SSD 云磁盘
+
+#### 数据集
+
+* tpch sf10 `lineitem`表数据集，30个文件，总共约 22 GB，1.8亿行
+
+#### 测试工具
+
+* [DataX](https://github.com/alibaba/DataX)
+
+#### 测试方法
+
+* 通过`txtfilereader`向`mysqlwriter`写入数据，配置不同并发数和单个`INSERT`的行数
+
+#### 测试结果
+
+| 单个insert的行数 | 并发数 | 导入速率(行/秒) | 导入吞吐(MB/秒) |
+|-------------|-----|-----------|----------|
+| 100 | 20  | 106931    | 11.46 |
+
+在上面的测试中，FE 的 CPU使用率在60-70%左右，BE 的 CPU使用率在10-20%左右。
\ No newline at end of file


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

(doris) 01/04: [doc](group commit) Add group commit performance (#31343)

Reply via email to