(doris) 26/37: [docs](update) Update Doris-Streamloader docs (#30552)

yiguolei Wed, 31 Jan 2024 07:54:51 -0800

This is an automated email from the ASF dual-hosted git repository.

yiguolei pushed a commit to branch branch-2.1
in repository https://gitbox.apache.org/repos/asf/doris.git


commit 88edb26e895179239adf2319b5db398bf6b727ff
Author: KassieZ <139741991+kass...@users.noreply.github.com>
AuthorDate: Wed Jan 31 15:02:40 2024 +0800

    [docs](update) Update Doris-Streamloader docs (#30552)
---
 .../import/import-way/stream-load-manual.md        |  13 ++
 docs/en/docs/ecosystem/doris-streamloader.md       | 250 +++++++++++++++++++++
 docs/sidebars.json                                 |   1 +
 .../import/import-way/stream-load-manual.md        |  12 +
 docs/zh-CN/docs/ecosystem/doris-streamloader.md    | 247 ++++++++++++++++++++
 5 files changed, 523 insertions(+)

diff --git a/docs/en/docs/data-operate/import/import-way/stream-load-manual.md 
b/docs/en/docs/data-operate/import/import-way/stream-load-manual.md
index bb2cdde4e3d..e7aefde9357 100644
--- a/docs/en/docs/data-operate/import/import-way/stream-load-manual.md
+++ b/docs/en/docs/data-operate/import/import-way/stream-load-manual.md
@@ -30,6 +30,19 @@ Stream load is a synchronous way of importing. Users import 
local files or data
 
 Stream load is mainly suitable for importing local files or data from data 
streams through procedures.
 
+:::tip
+
+In comparison to single-threaded load using `curl`, Doris-Streamloader is a 
client tool designed for loading data into Apache Doris. it reduces the 
ingestion latency of large datasets by its concurrent loading capabilities. It 
comes with the following features:
+
+- **Parallel loading**: multi-threaded load for the Stream Load method. You 
can set the parallelism level using the `workers` parameter.
+- **Multi-file load:** simultaneously load of multiple files and directories 
with one shot. It supports recursive file fetching and allows you to specify 
file names with wildcard characters.
+- **Path traversal support:** support path traversal when the source files are 
in directories
+- **Resilience and continuity:** in case of partial load failures, it can 
resume data loading from the point of failure.
+- **Automatic retry mechanism:** in case of loading failures, it can 
automatically retry a default number of times. If the loading remains 
unsuccessful, it will print the command for manual retry.
+
+See [Doris-Streamloader](../docs/ecosystem/doris-streamloader) for detailed 
instructions and best practices.
+:::
+
 ## Basic Principles
 
 The following figure shows the main flow of Stream load, omitting some import 
details.
diff --git a/docs/en/docs/ecosystem/doris-streamloader.md 
b/docs/en/docs/ecosystem/doris-streamloader.md
new file mode 100644
index 00000000000..4dc46085241
--- /dev/null
+++ b/docs/en/docs/ecosystem/doris-streamloader.md
@@ -0,0 +1,250 @@
+---
+{
+    "title": "Doris-Streamloader",
+    "language": "en"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+
+## Overview
+Doris-Streamloader is a client tool designed for loading data into Apache 
Doris. In comparison to single-threaded load using `curl`, it reduces the load 
latency of large datasets by its concurrent loading capabilities. It comes with 
the following features:
+
+- **Parallel loading**: multi-threaded load for the Stream Load method. You 
can set the parallelism level using the `workers` parameter.
+- **Multi-file load:** simultaneously load of multiple files and directories 
with one shot. It supports recursive file fetching and allows you to specify 
file names with wildcard characters.
+- **Path traversal support:** support path traversal when the source files are 
in directories
+- **Resilience and continuity:** in case of partial load failures, it can 
resume data loading from the point of failure.
+- **Automatic retry mechanism:** in case of loading failures, it can 
automatically retry a default number of times. If the loading remains 
unsuccessful, it will print the command for manual retry.
+
+
+## Installation
+
+**Version 1.0**
+
+Source Code: https://github.com/apache/doris-streamloader/
+
+| Version | Date |  Architecture  |  Link  |
+|---|---|---|---|
+| v1.0   |  20240131 |  x64  | 
https://apache-doris-releases.oss-accelerate.aliyuncs.com/apache-doris-streamloader-1.0.1-bin-x64.tar.xz
 |
+| v1.0   | 20240131  | arm64 | 
https://apache-doris-releases.oss-accelerate.aliyuncs.com/apache-doris-streamloader-1.0.1-bin-arm64.tar.xz
 |
+
+:::note
+The obtained result is the executable binary.
+:::
+
+
+## How to use
+
+```bash
+
+doris-streamloader --source_file={FILE_LIST} 
--url={FE_OR_BE_SERVER_URL}:{PORT} --header={STREAMLOAD_HEADER} 
--db={TARGET_DATABASE} --table={TARGET_TABLE}
+
+
+```
+
+**1. `FILE_LIST` support:**
+
+- Single file
+
+    E.g. Load a single file
+
+
+    ```json
+    
+    doris-streamloader --source_file="dir" --url="http://localhost:8330"; 
--header="column_separator:|?columns:col1,col2" --db="testdb" --table="testtbl"
+    
+    ```
+
+- Single directory
+
+    E.g. Load a single directory
+
+    ```json
+    doris-streamloader --source_file="dir" --url="http://localhost:8330"; 
--header="column_separator:|?columns:col1,col2" --db="testdb" --table="testtbl" 
       
+    ```
+
+- File names with wildcard characters (enclosed in quotes)
+
+    E.g. Load file0.csv, file1.csv, file2.csv
+
+    ```json
+    doris-streamloader --source_file="file*" --url="http://localhost:8330"; 
--header="column_separator:|?columns:col1,col2" --db="testdb" --table="testtbl"
+    ```
+
+- A list of files separated by commas
+
+    E.g. Load file0.csv, file1.csv, file2.csv
+    
+  ```json
+   doris-streamloader --source_file="file0.csv,file1.csv,file2.csv" 
--url="http://localhost:8330"; --header="column_separator:|?columns:col1,col2" 
--db="testdb" --table="testtbl"
+  ```
+
+- A list of directories separated by commas
+
+  E.g. Load dir1, dir2, dir3
+
+   ```json
+    doris-streamloader --source_file="dir1,dir2,dir3" 
--url="http://localhost:8330"; --header="column_separator:|?columns:col1,col2" 
--db="testdb" --table="testtbl" 
+   ```
+  
+
+**2. `STREAMLOAD_HEADER` supports all streamload headers separated with '?' if 
there is more than one**
+
+Example:
+
+```bash
+doris-streamloader --source_file="data.csv" --url="http://localhost:8330"; 
--header="column_separator:|?columns:col1,col2" --db="testdb" --table="testtbl"
+```
+
+The parameters above are required, and the following parameters are optional: 
+
+| Parameter | Description  |  Default Value  |  Suggestions  |
+|---|---|---|---|
+| --u      | Username of the database |  root    |      |
+| --p      | Password |  empty string  |      |
+| --compress      | Whether to compress data upon HTTP transmission |  false   
 |   Remain as default. Compression and decompression can increase pressure on 
Doris-Streamloader side and the CPU resources on Doris BE side, so it is 
advised to only enable this when network bandwidth is constrained.   |
+|--timeout    | Timeout of the HTTP request sent to Doris (seconds) |  
60\*60\*10    | Remain as default |
+| --batch      | Granularity of batch reading and sending of files (rows) |  
4096    | Remain as default |
+| --batch_byte      | Granularity of batch reading and sending of files (byte) 
|  943718400 (900MB)    | Remain as default |
+| --workers   | Concurrency  level of data loading |  0    |   "0" means the 
auto mode, in which the streamload speed is based on the data size and disk 
throughput. You can dial this up for a high-performance cluster, but it is 
advised to keep it below 10. If you observe excessive memory usage (via the 
memtracker in log), you can dial this down.   |
+| --disk_throughput      | Disk throughput (MB/s) |  800    |  Usually remain 
as default. This parameter is a basis of the automatic inference of workers. 
You can adjust this based on your needs to get a more appropriate value of 
workers.  |
+|--streamload_throughput | Streamload throughput (MB/s) | 100 | Usually remain 
as default. The default value is derived from the streamload throughput and 
predicted performance provided by the daily performance testing environment. To 
get a more appropriate value of workers, you can configure this based on your 
measured streamload throughput: (LoadBytes*1000)/(LoadTimeMs*1024*1024) |
+| --max_byte_per_task      | Maximum data size for each load task. For a 
dataset exceeding this size, the remaining part will be split into a new load 
task. |  107374182400 (100G)    | This is recommended to be large in order to 
reduce the number of load versions. However, if you encounter a "body exceed 
max size" and try to avoid adjusting the streaming_load_max_mb parameter (which 
requires restarting the backend), or if you encounter a "-238 TOO MANY SEGMENT" 
error, you can temporarily [...]
+| --check_utf8 | <p>Whether to check the encoding of the data that has been 
loaded: </p>   <p> 1) false, direct load of raw data without checking;  2)  
true, replacing non UTF-8 characters with � </p> | true |Remain as default|
+|--debug |Print debug log | false | Remain as default |
+|--auto_retry| The list of failed workers and tasks for auto retry | empty 
string | This is only used when there is an load failure. The serial numbers of 
the failed workers and tasks will be shown and all you need is to copy and 
execute the the entire command. For example, if --auto_retry="1,1;2,1", that 
means the failed tasks include the first task in the first worker and the first 
task in the second worker. |
+|--auto_retry_times | Times of auto retries | 3 | Remain as default. If you 
don't need retries, you can set this to 0. |
+|--auto_retry_interval | Interval of auto retries | 60 | Remain as default. If 
the load failure is caused by a Doris downtime, it is recommended to set this 
parameter based on the restart interval of Doris. |
+|--log_filename | Path for log storage | "" | Logs are printed to the console 
by default. To print them to a log file, you can set the path, such as 
--log_filename="/var/log". |
+
+
+
+## Result description
+
+A result will be returned no matter the data loading succeeds or fails. 
+
+
+|Parameter | Description |
+|---|---|
+| Status | Loading succeeded or failed |
+| TotalRows | Total number of rows |
+| FailLoadRows | Number of rows failed to be loaded |
+| LoadedRows | Number of rows loaded |
+| FilteredRows | Number of rows filtered |
+| UnselectedRows | Number of rows unselected |
+| LoadBytes | Number of bytes loaded |
+| LoadTimeMs | Actual loading time |
+| LoadFiles | List of loaded files |
+
+
+
+Examples: 
+
+- If the loading succeeds, you will see a result like: 
+  ```Go
+  Load Result: {
+          "Status": "Success",
+          "TotalRows": 120,
+          "FailLoadRows": 0,
+          "LoadedRows": 120,
+          "FilteredRows": 0,
+          "UnselectedRows": 0,
+          "LoadBytes": 40632,
+          "LoadTimeMs": 971,
+          "LoadFiles": [
+                  "basic.csv",
+                  "basic_data1.csv",
+                  "basic_data2.csv",
+                  "dir1/basic_data.csv",
+                  "dir1/basic_data.csv.1",
+                  "dir1/basic_data1.csv"
+          ]
+  }
+  ```
+  
+- If the loading fails (or partially fails), you will see a retry message: 
+  
+  ```Go
+  load has some error and auto retry failed, you can retry by : 
+  ./doris-streamloader --source_file 
/mnt/disk1/laihui/doris/tools/tpch-tools/bin/tpch-data/lineitem.tbl.1  
--url="http://127.0.0.1:8239"; --header="column_separator:|?columns: l_orderkey, 
l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, 
l_tax, l_returnflag,l_linestatus, 
l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment,temp" 
--db="db" --table="lineitem1" -u root -p "" --compress=false --timeout=36000 
--workers=3 --batch=4096 --batch_byt [...]
+  ```
+  
+
+You can copy and execute the command. The failure message will also be 
provided:
+
+```Go
+Load Result: {
+      "Status": "Failed",
+      "TotalRows": 1,
+      "FailLoadRows": 1,
+      "LoadedRows": 0,
+      "FilteredRows": 0,
+      "UnselectedRows": 0,
+      "LoadBytes": 0,
+      "LoadTimeMs": 104,
+      "LoadFiles": [
+              
"/mnt/disk1/laihui/doris/tools/tpch-tools/bin/tpch-data/lineitem.tbl.1"
+      ]
+}
+
+```
+
+
+## Best practice
+
+### Parameter suggestions
+
+1. Required parameters:  
+```--source_file=FILE_LIST --url=FE_OR_BE_SERVER_URL_WITH_PORT 
--header=STREAMLOAD_HEADER --db=TARGET_DATABASE --table=TARGET_TABLE``` 
+   If you need to load multiple files, you should configure all of them at a 
time in `source_file`.
+
+2. The default value of `workers` is the number of CPU cores. When that is 
large, for example, 96 cores, the value of `workers` should be dialed down. 
**The recommended value for most cases is 8.**
+
+3. `max_byte_per_task` is recommended to be large in order to reduce the 
number of load versions. However, if you encounter a "body exceed max size" and 
try to avoid adjusting the streaming_load_max_mb parameter (which requires 
restarting the backend), or if you encounter a `-238 TOO MANY SEGMENT` error, 
you can temporarily dial this down. **For most cases, this can remain as 
default.**
+
+**Two parameters that impacts the number of versions:**
+
+- `workers`: The more `workers`, the higher concurrency level, and thus the 
more versions. The recommended value for most cases is 8.
+- `max_byte_per_task`:  The larger `max_byte_per_task` , the larger data size 
in one single version, and thus the less versions. However, if this is 
excessively high, it could easily cause an `-238 TOO MANY SEGMENT ` error. For 
most cases, this can remain as default. 
+
+
+
+### Recommended commands
+
+In most cases, you only need to set the required parameters and `workers`. 
+
+```text
+./doris-streamloader --source_file="demo.csv,demoFile*.csv,demoDir" 
--url="http://127.0.0.1:8030"; --header="column_separator:," --db="demo" 
--table="test_load" --u="root" --workers=8
+```
+
+
+### FAQ
+
+- Before resumable loading was available, to fix any partial failures in 
loading would require deleting the current table and starting over. In this 
case, Doris-Streamloader would retry automatically. If the retry fails, a retry 
command will be printed so you can copy and execute it.
+- The default maximum data loading size for Doris-Streamloader is limited by 
BE config `streaming_load_max_mb` (default: 100GB). If you don't want to 
restart BE, you can also dial down `max_byte_per_task`.
+
+  To show current `streaming_load_max_mb`: 
+
+  ```Go
+  curl "http://127.0.0.1:8040/api/show_config";
+  ```
+  
+- If you encounter an `-238 TOO MANY SEGMENT ` error, you can dial down 
`max_byte_per_task`.
\ No newline at end of file
diff --git a/docs/sidebars.json b/docs/sidebars.json
index f892da7c345..a366e32d0d3 100644
--- a/docs/sidebars.json
+++ b/docs/sidebars.json
@@ -259,6 +259,7 @@
             "items": [
                 "ecosystem/spark-doris-connector",
                 "ecosystem/flink-doris-connector",
+                "ecosystem/doris-streamloader",
                 "ecosystem/datax",
                 "ecosystem/seatunnel",
                 "ecosystem/kyuubi",
diff --git 
a/docs/zh-CN/docs/data-operate/import/import-way/stream-load-manual.md 
b/docs/zh-CN/docs/data-operate/import/import-way/stream-load-manual.md
index 0f4dc28c737..69e47407906 100644
--- a/docs/zh-CN/docs/data-operate/import/import-way/stream-load-manual.md
+++ b/docs/zh-CN/docs/data-operate/import/import-way/stream-load-manual.md
@@ -30,6 +30,18 @@ Stream load 是一个同步的导入方式，用户通过发送 HTTP 协议发
 
 Stream load 主要适用于导入本地文件，或通过程序导入数据流中的数据。
 
+:::tip
+相比于直接使用 `curl` 的单并发导入，更推荐使用 **专用导入工具 Doris-Streamloader** 该工具是一款用于将数据导入 Doris 
数据库的专用客户端工具，可以提供 **多并发导入** 的功能，降低大数据量导入的耗时。拥有以下功能：
+
+- 并发导入，实现 Stream Load 的多并发导入。可以通过 `workers` 值设置并发数。
+- 多文件导入，一次导入可以同时导入多个文件及目录，支持设置通配符以及会自动递归获取文件夹下的所有文件。
+- 断点续传，在导入过程中可能出现部分失败的情况，支持在失败点处进行继续传输。
+- 自动重传，在导入出现失败的情况后，无需手动重传，工具会自动重传默认的次数，如果仍然不成功，打印出手动重传的命令。
+
+
+点击 [Doris-Streamloader 文档](../docs/ecosystem/doris-streamloader)了解使用方法与实践详情。
+:::
+
 ## 基本原理
 
 下图展示了 Stream load 的主要流程，省略了一些导入细节。
diff --git a/docs/zh-CN/docs/ecosystem/doris-streamloader.md 
b/docs/zh-CN/docs/ecosystem/doris-streamloader.md
new file mode 100644
index 00000000000..40e7b1b3fbc
--- /dev/null
+++ b/docs/zh-CN/docs/ecosystem/doris-streamloader.md
@@ -0,0 +1,247 @@
+
+---
+{
+    "title": "Doris-Streamloader",
+    "language": "zh-CN"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+
+## 概述
+[Doris-Streamloader](https://github.com/apache/doris-streamloader) 是一款用于将数据导入 
Doris 数据库的专用客户端工具。相比于直接使用 `curl` 的单并发导入，该工具可以提供多并发导入的功能，降低大数据量导入的耗时。拥有以下功能：
+
+- 并发导入，实现 Stream Load 的多并发导入。可以通过 workers 值设置并发数。
+- 多文件导入，一次导入可以同时导入多个文件及目录，支持设置通配符以及会自动递归获取文件夹下的所有文件。
+- 断点续传，在导入过程中可能出现部分失败的情况，支持在失败点处进行继续传输。 
+- 自动重传，在导入出现失败的情况后，无需手动重传，工具会自动重传默认的次数，如果仍然不成功，打印出手动重传的命令。
+
+## 获取与安装
+
+**1.0 版本** 
+
+源代码:  https://github.com/apache/doris-streamloader
+
+| 版本    | 日期      |  平台  |  链接  |
+|---|---|---|---|
+| v1.0   |  20240131 |  x64  | 
https://apache-doris-releases.oss-accelerate.aliyuncs.com/apache-doris-streamloader-1.0.1-bin-x64.tar.xz|
+| v1.0   | 20240131  | arm64 | 
https://apache-doris-releases.oss-accelerate.aliyuncs.com/apache-doris-streamloader-1.0.1-bin-arm64.tar.xz|
+
+:::note
+获取结果即为可执行二进制。
+:::
+
+## 使用方法
+
+```bash
+
+doris-streamloader --source_file={FILE_LIST} 
--url={FE_OR_BE_SERVER_URL}:{PORT} --header={STREAMLOAD_HEADER} 
--db={TARGET_DATABASE} --table={TARGET_TABLE}
+
+```
+
+
+**1. `FILE_LIST` 支持：**
+
+- 单个文件
+
+    例如：导入单个文件 file.csv
+
+    ```json
+    doris-streamloader --source_file="dir" --url="http://localhost:8330"; 
--header="column_separator:|?columns:col1,col2" --db="testdb" --table="testtbl"
+    ```
+
+- 单个目录
+
+    例如：导入单个目录 dir
+
+    ```json
+    doris-streamloader --source_file="dir" --url="http://localhost:8330"; 
--header="column_separator:|?columns:col1,col2" --db="testdb" --table="testtbl" 
       
+    ```
+
+- 带通配符的文件名（需要用引号包围）
+
+    例如：导入 file0.csv, file1.csv, file2.csv
+
+    ```json
+    doris-streamloader --source_file="file*" --url="http://localhost:8330"; 
--header="column_separator:|?columns:col1,col2" --db="testdb" --table="testtbl"
+    ```
+
+- 逗号分隔的文件名列表
+
+    例如：导入 file0.csv, file1.csv file2.csv
+
+    ```json
+    doris-streamloader --source_file="file0.csv,file1.csv,file2.csv" 
--url="http://localhost:8330"; --header="column_separator:|?columns:col1,col2" 
--db="testdb" --table="testtbl"
+    ```
+
+- 逗号分隔的目录列表
+
+  例如：导入 dir1, dir2,dir3
+
+   ```json
+  doris-streamloader --source_file="dir1,dir2,dir3" 
--url="http://localhost:8330"; --header="column_separator:|?columns:col1,col2" 
--db="testdb" --table="testtbl" 
+   ```
+
+:::tip 
+当需要多个文件导入时，使用 Doris-Streamloader 也只会产生一个版本号 
+:::
+
+
+
+**2.** `STREAMLOAD_HEADER` **支持 Stream Load 的所有参数，多个参数之间用  '?' 分隔。**
+
+用法举例：
+
+```bash
+doris-streamloader --source_file="data.csv" --url="http://localhost:8330"; 
--header="column_separator:|?columns:col1,col2" --db="testdb" --table="testtbl"
+```
+
+上述参数均为必要参数，下面介绍可选参数：
+
+| 参数名    | 含义             |  默认值  |  建议  |
+|---|---|---|---|
+| --u      | 数据库用户名      |  root    |      |
+| --p      | 数据库用户对应的密码   |  空字符串    |      |
+| --compress      | 导入数据是否在 HTTP 传输时压缩   |  false    |   保持默认，打开后压缩解压会分别增加工具和 
Doris BE 的 CPU 压力，所以仅在数据源所在机器网络带宽瓶颈时打开   |
+|--timeout    | 向 Doris 发送 HTTP 请求的超时时间, 单位:秒   |  60\*60\*10    | 保持默认   |
+| --batch      | 文件批量读取和发送的粒度, 单位: 行      |  4096    |  保持默认    |
+| --batch_byte      | 文件批量读取和发送的粒度, 单位: byte      |  943718400 (900MB)    |  
保持默认    |
+| --workers   | 导入的并发数   |  0    |   设置成 0 为自动模式，会根据导入数据的大小，磁盘的吞吐量，Stream Load 
导入速度计算一个值。 也可以手动设置，性能好的集群可以设置大点，最好不要超过 10。如果观察到导入内存过高（通过观察 Memtracker 或者 Exceed 
日志）, 则可适当降低 worker 数量   |
+| --disk_throughput      | 磁盘的吞吐量，单位 MB/s   |  800    |  通常保持默认即可。该值参与 
--workers 的自动推算过程。 如果希望通过工具能计算出一个适当的 workers 数，可以根据实际磁盘吞吐设置。  |
+|--streamload_throughput | Stream Load 导入实际的吞吐大小，单位 MB/s | 100 | 通常保持默认即可。该值参与 
--workers 的自动推算过程。 默认值是通过每日性能测试环境给出的 Stream Load 吞吐量以及性能可预测性得出的。 
如果希望通过工具能计算出一个适当的 workers 数，可以设置实测的 Stream Load 
的吞吐，即：(LoadBytes\*1000)/(LoadTimeMs\*1024\*1024) 计算出实际的吞吐量 |
+| --max_byte_per_task      | 每个导入任务数据量的最大大小，超过这个值剩下的数据会被拆分到一个新的导入任务中。  |  
107374182400 (100G)    |  建议设置一个很大的值来减少导入的版本数。但如果遇到 body exceed max size 
错误且不想调整 streaming_load_max_mb 参数（需重启 be），又或是遇到 -238 TOO MANY SEGMENT 
错误，可以临时调小这个配置    |
+| --check_utf8 | <p>是否对导入数据的编码进行检查：</p>   <p> 1） false，那么不做检查直接将原始数据导入; 2） 
true，那么对数据中非 utf-8 编码的字符用 � 进行替代</p> | true |保持默认|
+|--debug |打印 Debug 日志 | false | 保持默认|
+|--auto_retry| 自动重传失败的 worker 序号和 task 序号的列表 | 空字符串 | 
仅导入失败时重传使用，正常导入无需关心。失败时会提示具体参数内容，复制执行即可。例：如果 --auto_retry="1,1,2,1" 则表示： 
需要重传的task为：第一个 worker 的第一个 task，第二个 worker 的第一个 task。 |
+|--auto_retry_times | 自动重传的次数 | 3 | 保持默认，如果不想重传需要把这个值设置为 0 |
+|--auto_retry_interval | 自动重传的间隔 | 60 | 保持默认，如果 Doris 因宕机导致失败，建议根据实际 Doris 
重启的时间间隔来设置该值 |
+|--log_filename | 日志存储的位置 | "" | 
默认将日志打印到控制台上，如果要打印到日志文件中，可以设置存储日志文件的路径，如--log_filename = "/var/log" |
+
+
+
+## 结果说明
+
+无论成功与失败，都会显示最终的结果，结果参数说明：
+
+
+|参数名 | 描述 |
+|---|---|
+| Status |  导入成功（Success）与否（Failed）|
+| TotalRows | 想要导入文件中所有的行数 |
+| FailLoadRows | 想要导入文件中没有导入的行数 |
+| LoadedRows | 实际导入 Doris 的行数 |
+| FilteredRows | 实际导入过程中被 Doris 过滤的行数 |
+| UnselectedRows | 实际导入过程中被 Doris 忽略的行数 |
+| LoadBytes | 实际导入的 byte 大小 |
+| LoadTimeMs | 实际导入的时间 |
+| LoadFiles | 实际导入的文件列表|
+
+
+
+具体例子如下:
+
+- 导入成功，成功信息如下： 
+
+  ```Go
+  Load Result: {
+          "Status": "Success",
+          "TotalRows": 120,
+          "FailLoadRows": 0,
+          "LoadedRows": 120,
+          "FilteredRows": 0,
+          "UnselectedRows": 0,
+          "LoadBytes": 40632,
+          "LoadTimeMs": 971,
+          "LoadFiles": [
+                  "basic.csv",
+                  "basic_data1.csv",
+                  "basic_data2.csv",
+                  "dir1/basic_data.csv",
+                  "dir1/basic_data.csv.1",
+                  "dir1/basic_data1.csv"
+          ]
+  }
+  ```
+  
+- 导入失败：如果导入过程中部分数据没有导入失败了，会给出重传信息，如：
+  
+  ```Go
+  load has some error, and auto retry failed, you can retry by : 
+  ./doris-streamloader --source_file 
/mnt/disk1/laihui/doris/tools/tpch-tools/bin/tpch-data/lineitem.tbl.1  
--url="http://127.0.0.1:8239"; --header="column_separator:|?columns: l_orderkey, 
l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, 
l_tax, l_returnflag,l_linestatus, 
l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment,temp" 
--db="db" --table="lineitem1" -u root -p "" --compress=false --timeout=36000 
--workers=3 --batch=4096 --batch_byt [...]
+  ```
+  
+只需复制运行该命令即可，`auto_retry` 说明可参考, 并给出失败的结果信息：
+
+```Go
+Load Result: {
+      "Status": "Failed",
+      "TotalRows": 1,
+      "FailLoadRows": 1,
+      "LoadedRows": 0,
+      "FilteredRows": 0,
+      "UnselectedRows": 0,
+      "LoadBytes": 0,
+      "LoadTimeMs": 104,
+      "LoadFiles": [
+              
"/mnt/disk1/laihui/doris/tools/tpch-tools/bin/tpch-data/lineitem.tbl.1"
+      ]
+}
+
+```
+
+
+## 最佳实践
+
+### 1. 参数推荐
+
+1. 必要参数，一定要配置： ```--source_file=FILE_LIST --url=FE_OR_BE_SERVER_URL_WITH_PORT 
--header=STREAMLOAD_HEADER --db=TARGET_DATABASE --table=TARGET_TABLE``` 
，**如果需要导入多个文件时，推荐使用** `source_file` **方式。**
+
+2. `workers`，默认值为 CPU 核数，在 CPU 核数过多的场景（比如 96 核）会产生太多的并发，需要减少这个值，**推荐一般设置为 8 
即可。**
+
+3. `max_byte_per_task`，可以设置一个很大的值来减少导入的 version 数。但如果遇到 `body exceed max size` 
错误且不想调整 `streaming_load_max_mb` 参数（需重启 BE），又或是遇到 `-238 TOO MANY SEGMENT` 
错误，可以临时调小这个配置，**一般使用默认即可。**
+
+4. 影响 version 数的两个参数：
+- `workers`：worker 数越多，版本号越多，并发越高，一般使用 8 即可。
+- `max_byte_per_task`：`max_byte_per_task` 越大，单个 version 数据量越大，version 
数越少，但是这个值过大可能会遇到 `-238 TOO MANY SEGMENT `的问题。一般使用默认值即可。
+
+
+
+### 2. 推荐命令
+
+设置必要参数以及设置 `workers=8` 即可。
+
+```text
+./doris-streamloader --source_file="demo.csv,demoFile*.csv,demoDir" 
--url="http://127.0.0.1:8030"; --header="column_separator:," --db="demo" 
--table="test_load" --u="root" --workers=8
+```
+
+
+### 3. FAQ
+
+- 
在导入过程中，遇到了部分子任务失败的问题，当时没有断点续传续传的功能，导入失败后重新删表导入，如果遇到这个问题，工具会进行自动重传，如果重传失败会打印出重传命令，复制后可以手动重传。
+- 该工具的默认单个导入是 100G，超过了 BE 默认的 `streaming_load_max_mb` 阈值如果不希望重启 BE，可以减少 
`max_byte_per_task` 这个参数的大小。
+ 
+  查看 `streaming_load_max_mb` 大小的方法：
+
+  ```Go
+  -curl "http://127.0.0.1:8040/api/show_config";
+  ```
+  
+- 导入过程如果遇到 `-238 TOO MANY SEGMENT` 的问题，可以减少 `max_byte_per_task` 的大小。
\ No newline at end of file


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

(doris) 26/37: [docs](update) Update Doris-Streamloader docs (#30552)

Reply via email to