This is an automated email from the ASF dual-hosted git repository. dataroaring pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 0a2b6754847 rewrite overview of loading. (#979) 0a2b6754847 is described below commit 0a2b67548478c5cdbe2e8d3ebce236d165a0dd9c Author: Yongqiang YANG <98214048+dataroar...@users.noreply.github.com> AuthorDate: Sat Aug 10 23:33:58 2024 +0800 rewrite overview of loading. (#979) --- docs/data-operate/import/load-manual.md | 96 ++++++---------------- .../current/data-operate/import/load-manual.md | 92 +++++---------------- 2 files changed, 45 insertions(+), 143 deletions(-) diff --git a/docs/data-operate/import/load-manual.md b/docs/data-operate/import/load-manual.md index ead11af8463..188c4b928df 100644 --- a/docs/data-operate/import/load-manual.md +++ b/docs/data-operate/import/load-manual.md @@ -24,17 +24,33 @@ specific language governing permissions and limitations under the License. --> -## Introduction to Import Solutions +Apache Doris offers various methods for importing and integrating data, allowing you to import data from diverse sources into the database. These methods can be categorized into four types: -This section provides an overview of import solutions in order to help users choose the most suitable import solution based on data source, file format, and data volume. +1. **Real-Time Writing**: Data is written into Doris tables in real-time via HTTP or JDBC, suitable for scenarios requiring immediate analysis and querying. + - For small amounts of data (once every 5 minutes), use [JDBC INSERT](./import-way/insert-into-manual.md). + - For higher concurrency or frequency (more than 20 concurrent writes or multiple writes per minute), enable [Group Commit](./import-way/group-commit-manual.md) and use JDBC INSERT or Stream Load. + - For high throughput, use [Stream Load](./import-way/stream-load-manua) via HTTP. -Doris supports various import methods, including Stream Load, Broker Load, Insert Into, Routine Load, and MySQL Load. In addition to using Doris's native import methods, Doris also provides a range of ecosystem tools to assist users in data import, including Spark Doris Connector, Flink Doris Connector, Doris Kafka Connector, DataX Doriswriter, and Doris Streamloader. +2. **Streaming Synchronization**: Real-time data streams (e.g., Flink, Kafka, transactional databases) are imported into Doris tables, ideal for real-time analysis and querying. + - Use [Flink Doris Connector](../../ecosystem/flink-doris-connector.md) to write Flink’s real-time data streams into Doris. + - Use [Routine Load](./import-way/routine-load-manual.md) or [Doris Kafka Connector](../../ecosystem/doris-kafka-connector.md) for Kafka’s real-time data streams. Routine Load pulls data from Kafka to Doris and supports CSV and JSON formats, while Kafka Connector writes data to Doris, supporting Avro, JSON, CSV, and Protobuf formats. + - Use [Flink CDC](../../ecosystem/flink-doris-connector.md) or [Datax](../../ecosystem/datax.md) to write transactional database CDC data streams into Doris. -For high-frequency small import scenarios, Doris also provides the Group Commit feature. Group Commit is not a new import method, but an extension to `INSERT INTO VALUES, Stream Load, Http Stream`, which batches small imports on the server side. +3. **Batch Import**: Data is batch-loaded from external storage systems (e.g., S3, HDFS, local files, NAS) into Doris tables, suitable for non-real-time data import needs. + - Use [Broker Load](./import-way/broker-load-manual.md) to write files from S3 and HDFS into Doris. + - Use [INSERT INTO SELECT](./import-way/insert-into-manual.md) to synchronize files from S3, HDFS, and NAS into Doris, with asynchronous writing via [JOB](../scheduler/job-scheduler.md). + - Use [Stream Load](./import-way/stream-load-manua) or [Doris Streamloader](../../ecosystem/doris-streamloader.md) to write local files into Doris. -Each import method and ecosystem tool has different use cases and supports different data sources and file formats. +4. **External Data Source Integration**: Query and partially import data from external sources (e.g., Hive, JDBC, Iceberg) into Doris tables. + - Create a [Catalog](../../lakehouse/lakehouse-overview.md) to read data from external sources and use [INSERT INTO SELECT](./import-way/insert-into-manual.md) to synchronize this data into Doris, with asynchronous writing via [JOB](../scheduler/job-scheduler.md). + - Use [X2Doris](./migrate-data-from-other-olap.md) to migrate data from other AP systems into Doris. + +Each import method in Doris is an implicit transaction by default. For more information on transactions, refer to [Transactions](../transaction.md). + +### Quick Overview of Import Methods + +Doris's import process mainly involves various aspects such as data sources, data formats, import methods, error handling, data transformation, and transactions. You can quickly browse the scenarios suitable for each import method and the supported file formats in the table below. -### Import Methods | Import Method | Use Case | Supported File Formats | Single Import Volume | Import Mode | | :-------------------------------------------- | :----------------------------------------- | ----------------------- | ----------------- | -------- | | [Stream Load](./import-way/stream-load-manual) | Import from local data | csv, json, parquet, orc | Less than 10GB | Synchronous | @@ -43,70 +59,4 @@ Each import method and ecosystem tool has different use cases and supports diffe | [INSERT INTO SELECT](./import-way/insert-into-manual.md) | <p>Import data between Doris internal tables</p><p>Import external tables</p> | SQL | Depending on memory size | Synchronous | | [Routine Load](./import-way/routine-load-manual.md) | Real-time import from Kafka | csv, json | Micro-batch import MB to GB | Asynchronous | | [MySQL Load](./import-way/mysql-load-manual.md) | Import from local data | csv | Less than 10GB | Synchronous | -| [Group Commit](./import-way/group-commit-manual.md) | High-frequency small batch import | Depending on the import method used | Micro-batch import KB | - | - - -### Ecosystem Tools - -| Ecosystem Tool | Use Case | -| --------------------- | ------------------------------------------------------------ | -| [Spark Doris Connector](../../ecosystem/spark-doris-connector.md) | Batch import data from Spark | -| [Flink Doris Connector](../../ecosystem/flink-doris-connector.md) | Real-time import data from Flink | -| [Doris Kafka Connector](../../ecosystem/doris-kafka-connector.md) | Real-time import data from Kafka | -| [DataX Doriswriter](../../ecosystem/datax.md) | Synchronize data from MySQL, Oracle, SQL Server, PostgreSQL, Hive, ADS, etc. | -| [Doris Streamloader](../../ecosystem/doris-streamloader.md) | Implements concurrent import for Stream Load, allowing multiple files and directories to be imported at once | -| [X2Doris](./migrate-data-from-other-olap.md) | Migrate data from other AP databases to Doris | - -### File Formats - -| File Format | Supported Import Methods | Supported Compression Formats | -| -------- | ------------------------------------ | ----------------------------------------- | -| csv | Stream Load, Broker Load, MySQL Load | gz, lzo, bz2, lz4, LZ4FRAME,lzop, deflate | -| json | Stream Load, Broker Load | Not supported | -| parquet | Stream Load, Broker Load | Not supported | -| orc | Stream Load, Broker Load | Not supported | - -### Data Sources - -| Data Source | Supported Import Methods | -| ---------------------------------------------- | ------------------------------------------------------ | -| Local data | <p>Stream Load</p> <p>StreamLoader</p> <p>MySQL Load</p> | -| Object storage | <p>Broker Load</p> <p>INSERT TO SELECT FROM S3 TVF</p> | -| HDFS | <p>Broker Load</p> <p>INSERT TO SELECT FROM HDFS TVF</p> | -| Kafka | <p>Routine Load</p> <p>Kakfa Doris Connector</p> | -| Flink | Flink Doris Connector | -| Spark | Spark Doris Connector | -| Mysql, PostgreSQL, Oracle, SQL Server, and other TP databases | <p>Import via external tables</p> <p>Flink Doris Connector</p> | -| Other AP databases | <p>X2Doris</p> <p>Import via external tables</p> <p>Spark/Flink Doris Connector</p> | - -## Concept Introduction - -This section mainly introduces some concepts related to import to help users better utilize the data import feature. - -### Atomicity - -All import tasks in Doris are atomic, meaning that a import job either succeeds completely or fails completely. Partially successful data import will not occur within the same import task, and atomicity and consistency between materialized views and base tables are also guaranteed. For simple import tasks, users do not need to perform additional configurations or operations. For materialized views associated with tables, atomicity and consistency with the base table are also guaranteed. - -More detailed info refer to [Transaction](../../data-operate/transaction.md). - -### Label Mechanism - -Import jobs in Doris can be assigned a label. This label is usually a user-defined string with certain business logic properties. If not specified by the user, the system will generate one automatically. The main purpose of the label is to uniquely identify an import task and ensure that the same label is imported successfully only once. - -The label is used to ensure that the corresponding import job can only be successfully imported once. If a label that has been successfully imported is used again, it will be rejected and an error message `Label already used` will be reported. With this mechanism, Doris can achieve `At-Most-Once` semantics on the Doris side. If combined with the `At-Least-Once` semantics of the upstream system, it is possible to achieve `Exactly-Once` semantics for importing data. - -### Import Mode - -Import mode can be either synchronous or asynchronous. For synchronous import methods, the result returned indicates whether the import is successful or not. For asynchronous import methods, a successful return only indicates that the job has been submitted successfully, not that the data import is successful. Users need to use the corresponding command to check the running status of the import job. - -### Data Transformation - -When importing data into a table, sometimes the content in the table may not be exactly the same as the content in the source data file, and data transformation is required. Doris supports performing certain transformations on the source data during the import process. Specifically, it includes mapping, conversion, pre-filtering, and post-filtering. - -### Error Data Handling - -During the import process, the data types of the original columns and the target columns may not be completely consistent. During the import, the values of original columns with inconsistent data types will be converted. During the conversion process, conversion failures may occur, such as field type mismatch or field length exceeded. Strict mode is used to control whether to filter out these conversion failure error data rows during the import process. - -### Minimum Write Replica Number - -By default, data import requires that at least a majority of replicas are successfully written for the import to be considered successful. However, this approach is not flexible and may cause inconvenience in certain scenarios. Doris allows users to set the minimum write replica number (Min Load Replica Num). For import data tasks, when the number of replicas successfully written is greater than or equal to the minimum write replica number, the import is considered successful. +| [Group Commit](./import-way/group-commit-manual.md) | High-frequency small batch import | Depending on the import method used | Micro-batch import KB | - | \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md index 3d774c545bf..e0337368e3d 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md @@ -24,17 +24,33 @@ specific language governing permissions and limitations under the License. --> -## 导入方案介绍 +Apache Doris 提供了多种导入和集成数据的方法,您可以使用合适的导入方式从各种源将数据导入到数据库中。Apache Doris 提供的数据导入方式可以分为四类: -本节对导入方案做一个总体介绍,以便大家根据数据源、文件格式、数据量等选择最合适的导入方案。 +1. **实时写入**:应用程序通过 HTTP 或者 JDBC 实时写入数据到 Doris 表中,适用于需要实时分析和查询的场景。 + * 极少量数据(5 分钟一次)时可以使用 [JDBC INSERT](./import-way/insert-into-manual.md) 写入数据。 + * 并发较高或者频次较高(大于 20 并发或者 1 分钟写入多次)时建议打开 [Group Commit](./import-way/group-commit-manual.md),使用 JDBC INSERT 或者 Stream Load 写入数据。 + * 吞吐较高时推荐使用 [Stream Load](./import-way/stream-load-manua) 通过 HTTP 写入数据。 -Doris支持的导入方式包括Stream Load、Broker Load、Insert Into、Routine Load、 MySQL Load。除了直接使用Doris原生的导入方式进行导入,Doris还提供了一系列的生态工具帮助用户进行数据导入,包括Spark Doris Connector、Flink Doris Connector、Doris Kafka Connector、DataX Doriswriter、Doris Streamloader等。 +2. **流式同步**:通过实时数据流(如 Flink、Kafka、事务数据库)将数据实时导入到 Doris 表中,适用于需要实时分析和查询的场景。 + * 可以使用 [Flink Doris Connector](../../ecosystem/flink-doris-connector.md) 将 Flink 的实时数据流写入到 Doris 表中。 + * 可以使用 [Routine Load](./import-way/routine-load-manual.md) 或者 [Doris Kafka Connector](../../ecosystem/doris-kafka-connector.md) 将 Kafka 的实时数据流写入到 Doris 表中。Routine Load 方式下,Doris 会调度任务将 Kafka 中的数据拉取并写入 Doris 中,目前支持 csv 和 json 格式的数据。Kafka Connector 方式下,由 Kafka 将数据写入到 Doris 中,支持 avro、json、csv、protobuf 格式的数据。 + * 可以使用 [Flink CDC](../../ecosystem/flink-doris-connector.md) 或 [ Datax](../../ecosystem/datax.md) 将事务数据库的 CDC 数据流写入到 Doris 中。 -针对高频小导入场景,Doris还提供了Group Commit功能。Group Commit 不是一种新的导入方式,而是对`INSERT INTO VALUES、Stream Load、Http Stream`的扩展,对小导入在服务端进行攒批。 +3. **批量导入**:将数据从外部存储系统(如 S3、HDFS、本地文件、NAS)批量加载到 Doris 表中,适用于非实时数据导入的需求。 + * 可以使用 [Broker Load](./import-way/broker-load-manual.md) 将 S3 和 HDFS 中的文件写入到 Doris 中。 + * 可以使用 [INSERT INTO SELECT](./import-way/insert-into-manual.md) 将 S3、HDFS 和 NAS 中的文件同步写入到 Doris 中,配合 [JOB](../scheduler/job-scheduler.md) 可以异步写入。 + * 可以使用 [Stream Load](./import-way/stream-load-manua) 或者 [Doris Streamloader](../../ecosystem/doris-streamloader.md) 将本地文件写入 Doris 中。 -每种导入方式和生态工具适用的场景不一样,支持的数据源、文件格式也有差异。 +4. **外部数据源集成**:通过与外部数据源(如 Hive、JDBC、Iceberg 等)的集成,实现对外部数据的查询和部分数据导入到 Doris 表中。 + * 可以创建 [Catalog](../../lakehouse/lakehouse-overview.md) 读取外部数据源中的数据,使用 [INSERT INTO SELECT](./import-way/insert-into-manual.md) 将外部数据源中的数据同步写入到 Doris 中,配合 [JOB](../scheduler/job-scheduler.md) 可以异步写入。 + * 可以使用 [X2Doris](./migrate-data-from-other-olap.md) 将其他 AP 系统的数据迁移到 Doris 中。 + +Doris 的每个导入默认都是一个隐式事务,事务相关的更多信息请参考[事务](../transaction.md)。 + +## 导入方式快速浏览 + +Doris 的导入主要涉及数据源、数据格式、导入方式、错误数据处理、数据转换、事务多个方面。您可以在如下表格中快速浏览各导入方式适合的场景和支持的文件格式。 -### 导入方式 | 导入方式 | 使用场景 | 支持的文件格式 | 单次导入数据量 | 导入模式 | | :-------------------------------------------- | :----------------------------------------- | ----------------------- | ----------------- | -------- | | [Stream Load](./import-way/stream-load-manual) | 从本地数据导入 | csv、json、parquet、orc | 小于10GB | 同步 | @@ -45,67 +61,3 @@ Doris支持的导入方式包括Stream Load、Broker Load、Insert Into、Routin | [MySQL Load](./import-way/mysql-load-manual.md) | 从本地数据导入 | csv | 小于10GB | 同步 | | [Group Commit](./import-way/group-commit-manual.md) | 高频小批量导入 | 根据使用的导入方式而定 | 微批导入KB | - | -### 生态工具 - -| 生态工具 | 使用场景 | -| --------------------- | ------------------------------------------------------------ | -| [Spark Doris Connector](../../ecosystem/spark-doris-connector.md) | 从spark批量导入数据 | -| [Flink Doris Connector](../../ecosystem/flink-doris-connector.md) | 从flink实时导入数据 | -| [Doris Kafka Connector](../../ecosystem/doris-kafka-connector.md) | 从kafaka实时导入数据 | -| [DataX Doriswriter](../../ecosystem/datax.md) | 从MySQL、Oracle、SqlServer、Postgre、Hive、ADS等同步数据 | -| [Doris Streamloader](../../ecosystem/doris-streamloader.md) | 实现了 Stream Load 的多并发导入,一次导入可以同时导入多个文件及目录 | -| [X2Doris](./migrate-data-from-other-olap.md) | 从其他AP数据库迁移数据到Doris | - -### 文件格式 - -| 文件格式 | 支持的导入方式 | 支持的压缩格式 | -| -------- | ------------------------------------ | ----------------------------------------- | -| csv | Stream Load、Broker Load、MySQL Load | gz, lzo, bz2, lz4, LZ4FRAME,lzop, deflate | -| json | Stream Load、Broker Load | 不支持 | -| parquet | Stream Load、Broker Load | 不支持 | -| orc | Stream Load、Broker Load | 不支持 | - -### 数据源 - -| 数据源 | 支持的导入方式 | -| ---------------------------------------------- | ------------------------------------------------------ | -| 本地数据 | <p>Stream Load</p> <p>StreamLoader</p> <p>MySQL load</p> | -| 对象存储 | <p>Broker Load</p> <p>INSERT TO SELECT FROM S3 TVF</p> | -| HDFS | <p>Broker Load</p> <p>INSERT TO SELECT FROM HDFS TVF</p> | -| Kafka | <p>Routine Load</p> <p>Kakfa Doris Connector</p> | -| Flink | Flink Doris Connector | -| Spark | Spark Doris Connector | -| Mysql、PostgreSQL,Oracle,SQLServer等TP数据库 | <p>通过外表导入</p> <p>Flink Doris Connector</p> | -| 其他AP数据库 | <p>X2Doris</p> <p>通过外表导入</p> <p>Spark/Flink Doris Connector</p> | - -## 概念介绍 - -本节主要对导入相关的一些概念进行介绍,以帮助大家更好的使用数据导入功能。 - -### 原子性 - -Doris 中所有导入任务都是原子性的,即一个导入作业要么全部成功,要么全部失败,不会出现仅部分数据导入成功的情况,并且在同一个导入任务中对多张表的导入也能够保证原子性。对于简单的导入任务,用户无需做额外配置或操作。对于表所附属的物化视图,也同时保证和基表的原子性和一致性。 - -更多详细信息参考[事务](../../data-operate/transaction.md)。 - -### 标签机制 - -Doris 的导入作业都可以设置一个 Label。这个 Label 通常是用户自定义的、具有一定业务逻辑属性的字符串,如果用户不指定,系统也会自动生成一个。Label 的主要作用是唯一标识一个导入任务,并且能够保证相同的 Label 仅会被成功导入一次。 - -Label 是用于保证对应的导入作业,仅能成功导入一次。一个被成功导入的 Label,再次使用时,会被拒绝并报错 `Label already used`。通过这个机制,可以在 Doris 侧做到 `At-Most-Once` 语义。如果结合上游系统的 `At-Least-Once` 语义,则可以实现导入数据的 `Exactly-Once` 语义。 - -### 导入模式 - -导入模式分为同步导入和异步导入。对于同步导入方式,返回结果即表示导入成功还是失败。而对于异步导入方式,返回成功仅代表作业提交成功,不代表数据导入成功,需要使用对应的命令查看导入作业的运行状态。 - -### 数据转化 - -在向表中导入数据时,有时候表中的内容与源数据文件中的内容不完全一致,需要对数据进行变换才行。Doris支持在导入过程中直接对源数据进行一些变换。具体有:映射、转换、前置过滤和后置过滤。 - -### 错误数据处理 - -在导入过程中,原始列跟目标列的数据类型可能不完全一致,导入时会对数据类型不一致的原始列值进行转换。转换过程中可能会发生字段类型不匹配、字段超长等转换失败的情况。严格模式用于控制导入过程中是否会对这些转换失败的错误数据行进行过滤。 - -### 最小写入副本数 - -默认情况下,数据导入要求至少有超过半数的副本写入成功,导入才算成功。然而,这种方式不够灵活,在某些场景会带来不便。Doris 允许用户设置最小写入副本数 (Min Load Replica Num)。对导入数据任务,当它成功写入的副本数大于或等于最小写入副本数时,导入即成功。 --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org