(doris-website) branch master updated: rewrite overview of loading. (#979)

dataroaring Sat, 10 Aug 2024 08:34:59 -0700

This is an automated email from the ASF dual-hosted git repository.

dataroaring pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 0a2b6754847 rewrite overview of loading. (#979)
0a2b6754847 is described below

commit 0a2b67548478c5cdbe2e8d3ebce236d165a0dd9c
Author: Yongqiang YANG <98214048+dataroar...@users.noreply.github.com>
AuthorDate: Sat Aug 10 23:33:58 2024 +0800

    rewrite overview of loading. (#979)
---
 docs/data-operate/import/load-manual.md            | 96 ++++++----------------
 .../current/data-operate/import/load-manual.md     | 92 +++++----------------
 2 files changed, 45 insertions(+), 143 deletions(-)

diff --git a/docs/data-operate/import/load-manual.md 
b/docs/data-operate/import/load-manual.md
index ead11af8463..188c4b928df 100644
--- a/docs/data-operate/import/load-manual.md
+++ b/docs/data-operate/import/load-manual.md
@@ -24,17 +24,33 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-## Introduction to Import Solutions
+Apache Doris offers various methods for importing and integrating data, 
allowing you to import data from diverse sources into the database. These 
methods can be categorized into four types:
 
-This section provides an overview of import solutions in order to help users 
choose the most suitable import solution based on data source, file format, and 
data volume.
+1. **Real-Time Writing**: Data is written into Doris tables in real-time via 
HTTP or JDBC, suitable for scenarios requiring immediate analysis and querying.
+    - For small amounts of data (once every 5 minutes), use [JDBC 
INSERT](./import-way/insert-into-manual.md).
+    - For higher concurrency or frequency (more than 20 concurrent writes or 
multiple writes per minute), enable [Group 
Commit](./import-way/group-commit-manual.md) and use JDBC INSERT or Stream Load.
+    - For high throughput, use [Stream Load](./import-way/stream-load-manua) 
via HTTP.
 
-Doris supports various import methods, including Stream Load, Broker Load, 
Insert Into, Routine Load, and MySQL Load. In addition to using Doris's native 
import methods, Doris also provides a range of ecosystem tools to assist users 
in data import, including Spark Doris Connector, Flink Doris Connector, Doris 
Kafka Connector, DataX Doriswriter, and Doris Streamloader.
+2. **Streaming Synchronization**: Real-time data streams (e.g., Flink, Kafka, 
transactional databases) are imported into Doris tables, ideal for real-time 
analysis and querying.
+    - Use [Flink Doris Connector](../../ecosystem/flink-doris-connector.md) to 
write Flink’s real-time data streams into Doris.
+    - Use [Routine Load](./import-way/routine-load-manual.md) or [Doris Kafka 
Connector](../../ecosystem/doris-kafka-connector.md) for Kafka’s real-time data 
streams. Routine Load pulls data from Kafka to Doris and supports CSV and JSON 
formats, while Kafka Connector writes data to Doris, supporting Avro, JSON, 
CSV, and Protobuf formats.
+    - Use [Flink CDC](../../ecosystem/flink-doris-connector.md) or 
[Datax](../../ecosystem/datax.md) to write transactional database CDC data 
streams into Doris.
 
-For high-frequency small import scenarios, Doris also provides the Group 
Commit feature. Group Commit is not a new import method, but an extension to 
`INSERT INTO VALUES, Stream Load, Http Stream`, which batches small imports on 
the server side.
+3. **Batch Import**: Data is batch-loaded from external storage systems (e.g., 
S3, HDFS, local files, NAS) into Doris tables, suitable for non-real-time data 
import needs.
+    - Use [Broker Load](./import-way/broker-load-manual.md) to write files 
from S3 and HDFS into Doris.
+    - Use [INSERT INTO SELECT](./import-way/insert-into-manual.md) to 
synchronize files from S3, HDFS, and NAS into Doris, with asynchronous writing 
via [JOB](../scheduler/job-scheduler.md).
+    - Use [Stream Load](./import-way/stream-load-manua) or [Doris 
Streamloader](../../ecosystem/doris-streamloader.md) to write local files into 
Doris.
 
-Each import method and ecosystem tool has different use cases and supports 
different data sources and file formats.
+4. **External Data Source Integration**: Query and partially import data from 
external sources (e.g., Hive, JDBC, Iceberg) into Doris tables.
+    - Create a [Catalog](../../lakehouse/lakehouse-overview.md) to read data 
from external sources and use [INSERT INTO 
SELECT](./import-way/insert-into-manual.md) to synchronize this data into 
Doris, with asynchronous writing via [JOB](../scheduler/job-scheduler.md).
+    - Use [X2Doris](./migrate-data-from-other-olap.md) to migrate data from 
other AP systems into Doris.
+
+Each import method in Doris is an implicit transaction by default. For more 
information on transactions, refer to [Transactions](../transaction.md).
+
+### Quick Overview of Import Methods
+
+Doris's import process mainly involves various aspects such as data sources, 
data formats, import methods, error handling, data transformation, and 
transactions. You can quickly browse the scenarios suitable for each import 
method and the supported file formats in the table below.
 
-### Import Methods
 | Import Method                                      | Use Case                
                   | Supported File Formats | Single Import Volume | Import 
Mode |
 | :-------------------------------------------- | 
:----------------------------------------- | ----------------------- | 
----------------- | -------- |
 | [Stream Load](./import-way/stream-load-manual)           | Import from local 
data                             | csv, json, parquet, orc | Less than 10GB     
     | Synchronous     |
@@ -43,70 +59,4 @@ Each import method and ecosystem tool has different use 
cases and supports diffe
 | [INSERT INTO SELECT](./import-way/insert-into-manual.md) | <p>Import data 
between Doris internal tables</p><p>Import external tables</p>      | SQL       
              | Depending on memory size  | Synchronous     |
 | [Routine Load](./import-way/routine-load-manual.md)      | Real-time import 
from Kafka                            | csv, json               | Micro-batch 
import MB to GB | Asynchronous     |
 | [MySQL Load](./import-way/mysql-load-manual.md)          | Import from local 
data                             | csv                     | Less than 10GB     
     | Synchronous     |
-| [Group Commit](./import-way/group-commit-manual.md)          | 
High-frequency small batch import                             | Depending on 
the import method used                     |  Micro-batch import KB         | - 
    |
-
-
-### Ecosystem Tools
-
-| Ecosystem Tool              | Use Case                                       
              |
-| --------------------- | 
------------------------------------------------------------ |
-| [Spark Doris Connector](../../ecosystem/spark-doris-connector.md) | Batch 
import data from Spark                                          |
-| [Flink Doris Connector](../../ecosystem/flink-doris-connector.md) | 
Real-time import data from Flink                                          |
-| [Doris Kafka Connector](../../ecosystem/doris-kafka-connector.md) | 
Real-time import data from Kafka                                         |
-| [DataX Doriswriter](../../ecosystem/datax.md)     | Synchronize data from 
MySQL, Oracle, SQL Server, PostgreSQL, Hive, ADS, etc.     |
-| [Doris Streamloader](../../ecosystem/doris-streamloader.md)    | Implements 
concurrent import for Stream Load, allowing multiple files and directories to 
be imported at once |
-| [X2Doris](./migrate-data-from-other-olap.md)               | Migrate data 
from other AP databases to Doris                                |
-
-### File Formats
-
-| File Format | Supported Import Methods                       | Supported 
Compression Formats                            |
-| -------- | ------------------------------------ | 
----------------------------------------- |
-| csv      | Stream Load, Broker Load, MySQL Load | gz, lzo, bz2, lz4, 
LZ4FRAME,lzop, deflate |
-| json     | Stream Load, Broker Load             | Not supported              
                      |
-| parquet  | Stream Load, Broker Load             | Not supported              
                      |
-| orc      | Stream Load, Broker Load             | Not supported              
                      |
-
-### Data Sources
-
-| Data Source                                         | Supported Import 
Methods                                         |
-| ---------------------------------------------- | 
------------------------------------------------------ |
-| Local data                                       | <p>Stream Load</p> 
<p>StreamLoader</p> <p>MySQL Load</p>              |
-| Object storage                                       | <p>Broker Load</p> 
<p>INSERT TO SELECT FROM S3 TVF</p>                |
-| HDFS                                           | <p>Broker Load</p> 
<p>INSERT TO SELECT FROM HDFS TVF</p>            |
-| Kafka                                          | <p>Routine Load</p> 
<p>Kakfa Doris Connector</p>                 |
-| Flink                                          | Flink Doris Connector       
                           |
-| Spark                                          | Spark Doris Connector       
                           |
-| Mysql, PostgreSQL, Oracle, SQL Server, and other TP databases | <p>Import 
via external tables</p> <p>Flink Doris Connector</p>                 |
-| Other AP databases                                   | <p>X2Doris</p> 
<p>Import via external tables</p> <p>Spark/Flink Doris Connector</p> |
-
-## Concept Introduction
-
-This section mainly introduces some concepts related to import to help users 
better utilize the data import feature.
-
-### Atomicity
-
-All import tasks in Doris are atomic, meaning that a import job either 
succeeds completely or fails completely. Partially successful data import will 
not occur within the same import task, and atomicity and consistency between 
materialized views and base tables are also guaranteed. For simple import 
tasks, users do not need to perform additional configurations or operations. 
For materialized views associated with tables, atomicity and consistency with 
the base table are also guaranteed.
-
-More detailed info refer to [Transaction](../../data-operate/transaction.md).
-
-### Label Mechanism
-
-Import jobs in Doris can be assigned a label. This label is usually a 
user-defined string with certain business logic properties. If not specified by 
the user, the system will generate one automatically. The main purpose of the 
label is to uniquely identify an import task and ensure that the same label is 
imported successfully only once.
-
-The label is used to ensure that the corresponding import job can only be 
successfully imported once. If a label that has been successfully imported is 
used again, it will be rejected and an error message `Label already used` will 
be reported. With this mechanism, Doris can achieve `At-Most-Once` semantics on 
the Doris side. If combined with the `At-Least-Once` semantics of the upstream 
system, it is possible to achieve `Exactly-Once` semantics for importing data.
-
-### Import Mode
-
-Import mode can be either synchronous or asynchronous. For synchronous import 
methods, the result returned indicates whether the import is successful or not. 
For asynchronous import methods, a successful return only indicates that the 
job has been submitted successfully, not that the data import is successful. 
Users need to use the corresponding command to check the running status of the 
import job.
-
-### Data Transformation
-
-When importing data into a table, sometimes the content in the table may not 
be exactly the same as the content in the source data file, and data 
transformation is required. Doris supports performing certain transformations 
on the source data during the import process. Specifically, it includes 
mapping, conversion, pre-filtering, and post-filtering.
-
-### Error Data Handling
-
-During the import process, the data types of the original columns and the 
target columns may not be completely consistent. During the import, the values 
of original columns with inconsistent data types will be converted. During the 
conversion process, conversion failures may occur, such as field type mismatch 
or field length exceeded. Strict mode is used to control whether to filter out 
these conversion failure error data rows during the import process.
-
-### Minimum Write Replica Number
-
-By default, data import requires that at least a majority of replicas are 
successfully written for the import to be considered successful. However, this 
approach is not flexible and may cause inconvenience in certain scenarios. 
Doris allows users to set the minimum write replica number (Min Load Replica 
Num). For import data tasks, when the number of replicas successfully written 
is greater than or equal to the minimum write replica number, the import is 
considered successful.
+| [Group Commit](./import-way/group-commit-manual.md)          | 
High-frequency small batch import                             | Depending on 
the import method used                     |  Micro-batch import KB         | - 
    |
\ No newline at end of file
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md
index 3d774c545bf..e0337368e3d 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md
@@ -24,17 +24,33 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-## 导入方案介绍
+Apache Doris 提供了多种导入和集成数据的方法，您可以使用合适的导入方式从各种源将数据导入到数据库中。Apache Doris 
提供的数据导入方式可以分为四类：
 
-本节对导入方案做一个总体介绍，以便大家根据数据源、文件格式、数据量等选择最合适的导入方案。
+1. **实时写入**：应用程序通过 HTTP 或者 JDBC 实时写入数据到 Doris 表中，适用于需要实时分析和查询的场景。
+    * 极少量数据（5 分钟一次）时可以使用 [JDBC INSERT](./import-way/insert-into-manual.md) 
写入数据。
+    * 并发较高或者频次较高（大于 20 并发或者 1 分钟写入多次）时建议打开 [Group 
Commit](./import-way/group-commit-manual.md)，使用 JDBC INSERT 或者 Stream Load 写入数据。
+    * 吞吐较高时推荐使用 [Stream Load](./import-way/stream-load-manua) 通过 HTTP 写入数据。
 
-Doris支持的导入方式包括Stream Load、Broker Load、Insert Into、Routine Load、 MySQL 
Load。除了直接使用Doris原生的导入方式进行导入，Doris还提供了一系列的生态工具帮助用户进行数据导入，包括Spark Doris 
Connector、Flink Doris Connector、Doris Kafka Connector、DataX Doriswriter、Doris 
Streamloader等。
+2. **流式同步**：通过实时数据流（如 Flink、Kafka、事务数据库）将数据实时导入到 Doris 表中，适用于需要实时分析和查询的场景。
+    * 可以使用 [Flink Doris Connector](../../ecosystem/flink-doris-connector.md) 将 
Flink 的实时数据流写入到 Doris 表中。
+    * 可以使用 [Routine Load](./import-way/routine-load-manual.md) 或者 [Doris Kafka 
Connector](../../ecosystem/doris-kafka-connector.md) 将 Kafka 的实时数据流写入到 Doris 
表中。Routine Load 方式下，Doris 会调度任务将 Kafka 中的数据拉取并写入 Doris 中，目前支持 csv 和 json 
格式的数据。Kafka Connector 方式下，由 Kafka 将数据写入到 Doris 中，支持 avro、json、csv、protobuf 
格式的数据。
+    * 可以使用 [Flink CDC](../../ecosystem/flink-doris-connector.md) 或 [ 
Datax](../../ecosystem/datax.md) 将事务数据库的 CDC 数据流写入到 Doris 中。
 
-针对高频小导入场景，Doris还提供了Group Commit功能。Group Commit 不是一种新的导入方式，而是对`INSERT INTO 
VALUES、Stream Load、Http Stream`的扩展，对小导入在服务端进行攒批。
+3. **批量导入**：将数据从外部存储系统（如 S3、HDFS、本地文件、NAS）批量加载到 Doris 表中，适用于非实时数据导入的需求。
+    * 可以使用 [Broker Load](./import-way/broker-load-manual.md) 将 S3 和 HDFS 
中的文件写入到 Doris 中。
+    * 可以使用 [INSERT INTO SELECT](./import-way/insert-into-manual.md) 将 S3、HDFS 
和 NAS 中的文件同步写入到 Doris 中，配合 [JOB](../scheduler/job-scheduler.md) 可以异步写入。
+    * 可以使用 [Stream Load](./import-way/stream-load-manua) 或者 [Doris 
Streamloader](../../ecosystem/doris-streamloader.md) 将本地文件写入 Doris 中。
 
-每种导入方式和生态工具适用的场景不一样，支持的数据源、文件格式也有差异。
+4. **外部数据源集成**：通过与外部数据源（如 Hive、JDBC、Iceberg 等）的集成，实现对外部数据的查询和部分数据导入到 Doris 表中。
+    * 可以创建 [Catalog](../../lakehouse/lakehouse-overview.md) 读取外部数据源中的数据，使用 
[INSERT INTO SELECT](./import-way/insert-into-manual.md) 将外部数据源中的数据同步写入到 Doris 
中，配合 [JOB](../scheduler/job-scheduler.md) 可以异步写入。
+    * 可以使用 [X2Doris](./migrate-data-from-other-olap.md) 将其他 AP 系统的数据迁移到 Doris 
中。
+
+Doris 的每个导入默认都是一个隐式事务，事务相关的更多信息请参考[事务](../transaction.md)。
+
+## 导入方式快速浏览
+
+Doris 的导入主要涉及数据源、数据格式、导入方式、错误数据处理、数据转换、事务多个方面。您可以在如下表格中快速浏览各导入方式适合的场景和支持的文件格式。
 
-### 导入方式
 | 导入方式                                      | 使用场景                             
      | 支持的文件格式          | 单次导入数据量    | 导入模式 |
 | :-------------------------------------------- | 
:----------------------------------------- | ----------------------- | 
----------------- | -------- |
 | [Stream Load](./import-way/stream-load-manual)           | 从本地数据导入           
                  | csv、json、parquet、orc | 小于10GB          | 同步     |
@@ -45,67 +61,3 @@ Doris支持的导入方式包括Stream Load、Broker Load、Insert Into、Routin
 | [MySQL Load](./import-way/mysql-load-manual.md)          | 从本地数据导入           
                  | csv                     | 小于10GB          | 同步     |
 | [Group Commit](./import-way/group-commit-manual.md)          | 高频小批量导入       
                      | 根据使用的导入方式而定                     |  微批导入KB         | -   
  |
 
-### 生态工具
-
-| 生态工具              | 使用场景                                                     
|
-| --------------------- | 
------------------------------------------------------------ |
-| [Spark Doris Connector](../../ecosystem/spark-doris-connector.md) | 
从spark批量导入数据                                          |
-| [Flink Doris Connector](../../ecosystem/flink-doris-connector.md) | 
从flink实时导入数据                                          |
-| [Doris Kafka Connector](../../ecosystem/doris-kafka-connector.md) | 
从kafaka实时导入数据                                         |
-| [DataX Doriswriter](../../ecosystem/datax.md)     | 
从MySQL、Oracle、SqlServer、Postgre、Hive、ADS等同步数据     |
-| [Doris Streamloader](../../ecosystem/doris-streamloader.md)    | 实现了 Stream 
Load 的多并发导入，一次导入可以同时导入多个文件及目录 |
-| [X2Doris](./migrate-data-from-other-olap.md)               | 
从其他AP数据库迁移数据到Doris                                |
-
-### 文件格式
-
-| 文件格式 | 支持的导入方式                       | 支持的压缩格式                            |
-| -------- | ------------------------------------ | 
----------------------------------------- |
-| csv      | Stream Load、Broker Load、MySQL Load | gz, lzo, bz2, lz4, 
LZ4FRAME,lzop, deflate |
-| json     | Stream Load、Broker Load             | 不支持                         
           |
-| parquet  | Stream Load、Broker Load             | 不支持                         
           |
-| orc      | Stream Load、Broker Load             | 不支持                         
           |
-
-### 数据源
-
-| 数据源                                         | 支持的导入方式                        
                |
-| ---------------------------------------------- | 
------------------------------------------------------ |
-| 本地数据                                       | <p>Stream Load</p> 
<p>StreamLoader</p> <p>MySQL load</p>         |
-| 对象存储                                       | <p>Broker Load</p> <p>INSERT TO 
SELECT FROM S3 TVF</p>              |
-| HDFS                                           | <p>Broker Load</p> 
<p>INSERT TO SELECT FROM HDFS TVF</p>         |
-| Kafka                                          | <p>Routine Load</p> 
<p>Kakfa  Doris Connector</p>              |
-| Flink                                          | Flink Doris Connector       
                           |
-| Spark                                          | Spark Doris Connector       
                           |
-| Mysql、PostgreSQL，Oracle，SQLServer等TP数据库 | <p>通过外表导入</p> <p>Flink Doris 
Connector</p>                |
-| 其他AP数据库                                   | <p>X2Doris</p> <p>通过外表导入</p> 
<p>Spark/Flink Doris Connector</p> |
-
-## 概念介绍
-
-本节主要对导入相关的一些概念进行介绍，以帮助大家更好的使用数据导入功能。
-
-### 原子性
-
-Doris 
中所有导入任务都是原子性的，即一个导入作业要么全部成功，要么全部失败，不会出现仅部分数据导入成功的情况，并且在同一个导入任务中对多张表的导入也能够保证原子性。对于简单的导入任务，用户无需做额外配置或操作。对于表所附属的物化视图，也同时保证和基表的原子性和一致性。
-
-更多详细信息参考[事务](../../data-operate/transaction.md)。
-
-### 标签机制
-
-Doris 的导入作业都可以设置一个 Label。这个 Label 
通常是用户自定义的、具有一定业务逻辑属性的字符串，如果用户不指定，系统也会自动生成一个。Label 的主要作用是唯一标识一个导入任务，并且能够保证相同的 
Label 仅会被成功导入一次。
-
-Label 是用于保证对应的导入作业，仅能成功导入一次。一个被成功导入的 Label，再次使用时，会被拒绝并报错 `Label already 
used`。通过这个机制，可以在 Doris 侧做到 `At-Most-Once` 语义。如果结合上游系统的 `At-Least-Once` 
语义，则可以实现导入数据的 `Exactly-Once` 语义。
-
-### 导入模式
-
-导入模式分为同步导入和异步导入。对于同步导入方式，返回结果即表示导入成功还是失败。而对于异步导入方式，返回成功仅代表作业提交成功，不代表数据导入成功，需要使用对应的命令查看导入作业的运行状态。
-
-### 数据转化
-
-在向表中导入数据时，有时候表中的内容与源数据文件中的内容不完全一致，需要对数据进行变换才行。Doris支持在导入过程中直接对源数据进行一些变换。具体有：映射、转换、前置过滤和后置过滤。
-
-### 错误数据处理
-
-在导入过程中，原始列跟目标列的数据类型可能不完全一致，导入时会对数据类型不一致的原始列值进行转换。转换过程中可能会发生字段类型不匹配、字段超长等转换失败的情况。严格模式用于控制导入过程中是否会对这些转换失败的错误数据行进行过滤。
-
-### 最小写入副本数
-
-默认情况下，数据导入要求至少有超过半数的副本写入成功，导入才算成功。然而，这种方式不够灵活，在某些场景会带来不便。Doris 允许用户设置最小写入副本数 
(Min Load Replica Num)。对导入数据任务，当它成功写入的副本数大于或等于最小写入副本数时，导入即成功。


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

(doris-website) branch master updated: rewrite overview of loading. (#979)

Reply via email to