This is an automated email from the ASF dual-hosted git repository. yiguolei pushed a commit to branch branch-2.1 in repository https://gitbox.apache.org/repos/asf/doris.git
commit 4085657a28ac1c165a756a284f4ab0b3db3f9d87 Author: meiyi <myime...@gmail.com> AuthorDate: Fri Feb 23 19:49:15 2024 +0800 [doc](group commit) Add group commit performance (#31343) --- .../import/import-way/group-commit-manual.md | 92 ++++++++++++++++++++- .../import/import-way/group-commit-manual.md | 94 +++++++++++++++++++++- 2 files changed, 179 insertions(+), 7 deletions(-) diff --git a/docs/en/docs/data-operate/import/import-way/group-commit-manual.md b/docs/en/docs/data-operate/import/import-way/group-commit-manual.md index e5f935b8da4..b2c677a8f63 100644 --- a/docs/en/docs/data-operate/import/import-way/group-commit-manual.md +++ b/docs/en/docs/data-operate/import/import-way/group-commit-manual.md @@ -335,13 +335,26 @@ curl --location-trusted -u {user}:{passwd} -T data.csv -H "group_commit:sync_mo See [Stream Load](stream-load-manual.md) for more detailed syntax used by **Http Stream**. -## Modify the group commit interval +## Group commit condition + +The data will be automatically committed either when the time interval (default is 10 seconds) or the data size (default is 64 MB) conditions meet. + +### Modify the time interval condition The default group commit interval is 10 seconds. Users can modify the configuration of the table: ```sql # Modify the group commit interval to 2 seconds -ALTER TABLE dt SET ("group_commit_interval_ms"="2000"); +ALTER TABLE dt SET ("group_commit_interval_ms" = "2000"); +``` + +### Modify the data size condition + +The default group commit data size is 64 MB. Users can modify the configuration of the table: + +```sql +# Modify the group commit data size to 128MB +ALTER TABLE dt SET ("group_commit_data_bytes" = "134217728"); ``` ## Limitations @@ -362,7 +375,7 @@ ALTER TABLE dt SET ("group_commit_interval_ms"="2000"); * Two phase commit - * Specify the label + * Specify the label by set header `-H "label:my_label"` * Column update @@ -413,3 +426,76 @@ ALTER TABLE dt SET ("group_commit_interval_ms"="2000"); * Description: The `max_filter_ratio` limit can only work if the total rows of `group commit` is less than this value. * Default: 10000 +## Performance + +We have separately tested the write performance of group commit in high-concurrency scenarios with small data volumes using `Stream Load` and `JDBC` (in `async mode`). + +### Stream Load + +#### Environment + +* 1 FE: 8-core CPU, 16 GB RAM, 1 200 GB SSD disk +* 3 BE: 16-core CPU, 64 GB RAM, 1 2 TB SSD disk +* 1 Client: 8-core CPU, 64 GB RAM, 1 100 GB SSD disk + +#### DataSet + +* `httplogs`, 31 GB, 247249096 (247 million) rows + +#### Test Tool + +* [doris-streamloader](https://github.com/apache/doris-streamloader) + +#### Test Method + +* Setting different single-concurrency data size and concurrency num between `non group_commit` and `group_commit` modes. + +#### Test Result + +| Load Way | Single-concurrency Data Size | Concurrency | Cost Seconds | Rows / Seconds | MB / Seconds | +|--------------------|------------------------------|-------------|--------------------|----------------|--------------| +| `group_commit` | 10 KB | 10 | 3707 | 66,697 | 8.56 | +| `group_commit` | 10 KB | 30 | 3385 | 73,042 | 9.38 | +| `group_commit` | 100 KB | 10 | 473 | 522,725 | 67.11 | +| `group_commit` | 100 KB | 30 | 390 | 633,972 | 81.39 | +| `group_commit` | 500 KB | 10 | 323 | 765,477 | 98.28 | +| `group_commit` | 500 KB | 30 | 309 | 800,158 | 102.56 | +| `group_commit` | 1 MB | 10 | 304 | 813,319 | 104.24 | +| `group_commit` | 1 MB | 30 | 286 | 864,507 | 110.88 | +| `group_commit` | 10 MB | 10 | 290 | 852,583 | 109.28 | +| `non group_commit` | 1 MB | 10 | `-235 error` | | | +| `non group_commit` | 10 MB | 10 | 519 | 476,395 | 61.12 | +| `non group_commit` | 10 MB | 30 | `-235 error` | | | + +In the above test, the CPU usage of BE fluctuates between 10-40%. + +The `group_commit` effectively enhances import performance while reducing the number of versions, thereby alleviating the pressure on compaction. + +### JDBC + +#### Environment + +* 1 FE: 8-core CPU, 16 GB RAM, 1 200 GB SSD disk +* 1 BE: 16-core CPU, 64 GB RAM, 1 2 TB SSD disk +* 1 Client: 16-core CPU, 64 GB RAM, 1 100 GB SSD disk + +#### DataSet + +* The data of tpch sf10 `lineitem` table, 20 files, 14 GB, 120 million rows + +#### Test Method + +* [DataX](https://github.com/alibaba/DataX) + +#### Test Method + +* Use `txtfilereader` wtite data to `mysqlwriter`, config different concurrenncy and rows for per `INSERT` sql. + +#### Test Result + + +| Rows per insert | Concurrency | Rows / Second | MB / Second | +|-----------------|-------------|---------------|-------------| +| 100 | 20 | 106931 | 11.46 | + +In the above test, the CPU usage of BE fluctuates between 10-20%, FE fluctuates between 60-70%. diff --git a/docs/zh-CN/docs/data-operate/import/import-way/group-commit-manual.md b/docs/zh-CN/docs/data-operate/import/import-way/group-commit-manual.md index 76054ca49cc..e7dfd96d83a 100644 --- a/docs/zh-CN/docs/data-operate/import/import-way/group-commit-manual.md +++ b/docs/zh-CN/docs/data-operate/import/import-way/group-commit-manual.md @@ -334,13 +334,26 @@ curl --location-trusted -u {user}:{passwd} -T data.csv -H "group_commit:sync_mo 关于 Http Stream 使用的更多详细语法及最佳实践,请参阅 [Stream Load](stream-load-manual.md)。 -## 修改group commit默认提交间隔 +## 自动提交条件 -group commit 的默认提交间隔为 10 秒,用户可以通过修改表的配置,调整 group commit 的提交间隔: +当满足时间间隔(默认为 10 秒)或数据量(默认为 64 MB)其中一个条件时,会自动提交数据。 + +### 修改提交间隔 + +默认提交间隔为 10 秒,用户可以通过修改表的配置调整: ```sql # 修改提交间隔为 2 秒 -ALTER TABLE dt SET ("group_commit_interval_ms"="2000"); +ALTER TABLE dt SET ("group_commit_interval_ms" = "2000"); +``` + +### 修改提交数据量 + +group commit 的默认提交数据量为 64 MB,用户可以通过修改表的配置调整: + +```sql +# 修改提交数据量为 128MB +ALTER TABLE dt SET ("group_commit_data_bytes" = "134217728"); ``` ## 使用限制 @@ -361,7 +374,7 @@ ALTER TABLE dt SET ("group_commit_interval_ms"="2000"); + 两阶段提交 - + 指定 label + + 指定 label,即通过 `-H "label:my_label"`设置 + 列更新写入 @@ -411,3 +424,76 @@ ALTER TABLE dt SET ("group_commit_interval_ms"="2000"); * 描述: 当 group commit 导入的总行数不高于该值,`max_filter_ratio` 正常工作,否则不工作 * 默认值: 10000 + +## 性能 + +我们分别测试了使用`Stream Load`和`JDBC`在高并发小数据量场景下`group commit`(使用`async mode`)的写入性能。 + +### Stream Load日志场景测试 + +#### 机器配置 + +* 1台 FE:8核 CPU、16GB 内存、1块 200GB 通用性 SSD 云磁盘 +* 3台 BE:16核 CPU、64GB 内存、1块 2TB 通用性 SSD 云磁盘 +* 1台测试客户端:16核 CPU、64GB 内存、1块 100GB 通用型 SSD 云磁盘 + +#### 数据集 + +* `httplogs`数据集,总共 31GB、2.47亿条 + +#### 测试工具 + +* [doris-streamloader](https://github.com/apache/doris-streamloader) + +#### 测试方法 + +* 对比`非group_commit`和`group_commit`的`async_mode`模式下,设置不同的单并发数据量和并发数,导入`247249096`行数据 + +#### 测试结果 + +| 导入方式 | 单并发数据量 | 并发数 | 耗时(秒) | 导入速率(行/秒) | 导入吞吐(MB/秒) | +|----------------|---------|------|-----------|----------|-----------| +| `group_commit` | 10 KB | 10 | 3707 | 66,697 | 8.56 | +| `group_commit` | 10 KB | 30 | 3385 | 73,042 | 9.38 | +| `group_commit` | 100 KB | 10 | 473 | 522,725 | 67.11 | +| `group_commit` | 100 KB | 30 | 390 | 633,972 | 81.39 | +| `group_commit` | 500 KB | 10 | 323 | 765,477 | 98.28 | +| `group_commit` | 500 KB | 30 | 309 | 800,158 | 102.56 | +| `group_commit` | 1 MB | 10 | 304 | 813,319 | 104.24 | +| `group_commit` | 1 MB | 30 | 286 | 864,507 | 110.88 | +| `group_commit` | 10 MB | 10 | 290 | 852,583 | 109.28 | +| `非group_commit` | 1 MB | 10 | 导入报错-235 | | | +| `非group_commit` | 10 MB | 10 | 519 | 476,395 | 61.12 | +| `非group_commit` | 10 MB | 30 | 导入报错-235 | | | + +在上面的`group_commit`测试中,BE的CPU使用率在10-40%之间。 + +可以看出,`group_commit`模式在小数据量并发导入的场景下,能有效的提升导入性能,同时减少版本数,降低系统合并数据的压力。 + +### JDBC + +#### 机器配置 + +* 1台 FE:8核 CPU、16 GB 内存、1块 200 GB 通用性 SSD 云磁盘 +* 1台 BE:16核 CPU、64 GB 内存、1块 2 TB 通用性 SSD 云磁盘 +* 1台测试客户端:16核 CPU、64GB内存、1块 100 GB 通用型 SSD 云磁盘 + +#### 数据集 + +* tpch sf10 `lineitem`表数据集,30个文件,总共约 22 GB,1.8亿行 + +#### 测试工具 + +* [DataX](https://github.com/alibaba/DataX) + +#### 测试方法 + +* 通过`txtfilereader`向`mysqlwriter`写入数据,配置不同并发数和单个`INSERT`的行数 + +#### 测试结果 + +| 单个insert的行数 | 并发数 | 导入速率(行/秒) | 导入吞吐(MB/秒) | +|-------------|-----|-----------|----------| +| 100 | 20 | 106931 | 11.46 | + +在上面的测试中,FE 的 CPU使用率在60-70%左右,BE 的 CPU使用率在10-20%左右。 \ No newline at end of file --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org