This is an automated email from the ASF dual-hosted git repository. dataroaring pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new dd2f78db47 fix load overview (#1012) dd2f78db47 is described below commit dd2f78db4741e72d72c4a9f9cc6902c6f56f70f3 Author: Yongqiang YANG <98214048+dataroar...@users.noreply.github.com> AuthorDate: Tue Aug 20 15:08:54 2024 +0800 fix load overview (#1012) --- .../import/import-way/insert-into-manual.md | 6 ++- docs/data-operate/import/load-manual.md | 53 +++++++++++++--------- .../import/import-way/insert-into-manual.md | 7 ++- .../current/data-operate/import/load-manual.md | 45 ++++++++++-------- 4 files changed, 70 insertions(+), 41 deletions(-) diff --git a/docs/data-operate/import/import-way/insert-into-manual.md b/docs/data-operate/import/import-way/insert-into-manual.md index d9d7984e5b..70580e3cef 100644 --- a/docs/data-operate/import/import-way/insert-into-manual.md +++ b/docs/data-operate/import/import-way/insert-into-manual.md @@ -79,7 +79,7 @@ VALUES (1, "Emily", 25), (5, "Ava", 17); ``` -INSERT INTO is a synchronous import method, where the import result is directly returned to the user. +INSERT INTO is a synchronous import method, where the import result is directly returned to the user. You can enable [group commit](../import-way/group-commit-manual.md) to achieve high performance. ```JSON Query OK, 5 rows affected (0.308 sec) @@ -127,6 +127,10 @@ MySQL> SELECT COUNT(*) FROM testdb.test_table2; 1 row in set (0.071 sec) ``` +4. You can use [JOB](../../scheduler/job-scheduler.md) make the INSERT operation execute asynchronously. + +5. Sources can be [tvf](../../../lakehouse/file.md) or tables in a [catalog](../../../lakehouse/database). + ### View INSERT INTO jobs You can use the `SHOW LOAD` command to view the completed INSERT INTO tasks. diff --git a/docs/data-operate/import/load-manual.md b/docs/data-operate/import/load-manual.md index 188c4b928d..6dcf57e666 100644 --- a/docs/data-operate/import/load-manual.md +++ b/docs/data-operate/import/load-manual.md @@ -26,24 +26,35 @@ under the License. Apache Doris offers various methods for importing and integrating data, allowing you to import data from diverse sources into the database. These methods can be categorized into four types: -1. **Real-Time Writing**: Data is written into Doris tables in real-time via HTTP or JDBC, suitable for scenarios requiring immediate analysis and querying. - - For small amounts of data (once every 5 minutes), use [JDBC INSERT](./import-way/insert-into-manual.md). - - For higher concurrency or frequency (more than 20 concurrent writes or multiple writes per minute), enable [Group Commit](./import-way/group-commit-manual.md) and use JDBC INSERT or Stream Load. - - For high throughput, use [Stream Load](./import-way/stream-load-manua) via HTTP. +- **Real-Time Writing**: Data is written into Doris tables in real-time via HTTP or JDBC, suitable for scenarios requiring immediate analysis and querying. -2. **Streaming Synchronization**: Real-time data streams (e.g., Flink, Kafka, transactional databases) are imported into Doris tables, ideal for real-time analysis and querying. - - Use [Flink Doris Connector](../../ecosystem/flink-doris-connector.md) to write Flink’s real-time data streams into Doris. - - Use [Routine Load](./import-way/routine-load-manual.md) or [Doris Kafka Connector](../../ecosystem/doris-kafka-connector.md) for Kafka’s real-time data streams. Routine Load pulls data from Kafka to Doris and supports CSV and JSON formats, while Kafka Connector writes data to Doris, supporting Avro, JSON, CSV, and Protobuf formats. - - Use [Flink CDC](../../ecosystem/flink-doris-connector.md) or [Datax](../../ecosystem/datax.md) to write transactional database CDC data streams into Doris. + - For small amounts of data (once every 5 minutes), you can use [JDBC INSERT](./import-way/insert-into-manual.md). -3. **Batch Import**: Data is batch-loaded from external storage systems (e.g., S3, HDFS, local files, NAS) into Doris tables, suitable for non-real-time data import needs. - - Use [Broker Load](./import-way/broker-load-manual.md) to write files from S3 and HDFS into Doris. - - Use [INSERT INTO SELECT](./import-way/insert-into-manual.md) to synchronize files from S3, HDFS, and NAS into Doris, with asynchronous writing via [JOB](../scheduler/job-scheduler.md). - - Use [Stream Load](./import-way/stream-load-manua) or [Doris Streamloader](../../ecosystem/doris-streamloader.md) to write local files into Doris. + - For higher concurrency or frequency (more than 20 concurrent writes or multiple writes per minute), you can enable enable [Group Commit](./import-way/group-commit-manual.md) and use JDBC INSERT or Stream Load. -4. **External Data Source Integration**: Query and partially import data from external sources (e.g., Hive, JDBC, Iceberg) into Doris tables. - - Create a [Catalog](../../lakehouse/lakehouse-overview.md) to read data from external sources and use [INSERT INTO SELECT](./import-way/insert-into-manual.md) to synchronize this data into Doris, with asynchronous writing via [JOB](../scheduler/job-scheduler.md). - - Use [X2Doris](./migrate-data-from-other-olap.md) to migrate data from other AP systems into Doris. + - For high throughput, you can use [Stream Load](./import-way/stream-load-manua) via HTTP. + +- **Streaming Synchronization**: Real-time data streams (e.g., Flink, Kafka, transactional databases) are imported into Doris tables, ideal for real-time analysis and querying. + + - You can use [Flink Doris Connector](../../ecosystem/flink-doris-connector.md) to write Flink’s real-time data streams into Doris. + + - You can use [Routine Load](./import-way/routine-load-manual.md) or [Doris Kafka Connector](../../ecosystem/doris-kafka-connector.md) for Kafka’s real-time data streams. Routine Load pulls data from Kafka to Doris and supports CSV and JSON formats, while Kafka Connector writes data to Doris, supporting Avro, JSON, CSV, and Protobuf formats. + + - You can use [Flink CDC](../../ecosystem/flink-doris-connector.md) or [Datax](../../ecosystem/datax.md) to write transactional database CDC data streams into Doris. + +- **Batch Import**: Data is batch-loaded from external storage systems (e.g., S3, HDFS, local files, NAS) into Doris tables, suitable for non-real-time data import needs. + + - You can use [Broker Load](./import-way/broker-load-manual.md) to write files from S3 and HDFS into Doris. + + - You can use [INSERT INTO SELECT](./import-way/insert-into-manual.md) to synchronously load files from S3, HDFS, and NAS into Doris, and you can perform the operation asynchronously using a [JOB](../scheduler/job-scheduler.md). + + - You can use [Stream Load](./import-way/stream-load-manua) or [Doris Streamloader](../../ecosystem/doris-streamloader.md) to write local files into Doris. + +- **External Data Source Integration**: Query and partially import data from external sources (e.g., Hive, JDBC, Iceberg) into Doris tables. + + - You can create a [Catalog](../../lakehouse/lakehouse-overview.md) to read data from external sources and use [INSERT INTO SELECT](./import-way/insert-into-manual.md) to synchronize this data into Doris, with asynchronous writing via [JOB](../scheduler/job-scheduler.md). + + - You can use [X2Doris](./migrate-data-from-other-olap.md) to migrate data from other AP systems into Doris. Each import method in Doris is an implicit transaction by default. For more information on transactions, refer to [Transactions](../transaction.md). @@ -53,10 +64,10 @@ Doris's import process mainly involves various aspects such as data sources, dat | Import Method | Use Case | Supported File Formats | Single Import Volume | Import Mode | | :-------------------------------------------- | :----------------------------------------- | ----------------------- | ----------------- | -------- | -| [Stream Load](./import-way/stream-load-manual) | Import from local data | csv, json, parquet, orc | Less than 10GB | Synchronous | -| [Broker Load](./import-way/broker-load-manual.md) | Import from object storage, HDFS, etc. | csv, json, parquet, orc | Tens of GB to hundreds of GB | Asynchronous | -| [INSERT INTO VALUES](./import-way/insert-into-manual.md) | <p>Import single or small batch data</p><p>Import via JDBC, etc.</p> | SQL | Simple testing | Synchronous | -| [INSERT INTO SELECT](./import-way/insert-into-manual.md) | <p>Import data between Doris internal tables</p><p>Import external tables</p> | SQL | Depending on memory size | Synchronous | +| [Stream Load](./import-way/stream-load-manual) | Importing local files or push data in applications via http. | csv, json, parquet, orc | Less than 10GB | Synchronous | +| [Broker Load](./import-way/broker-load-manual.md) | Importing from object storage, HDFS, etc. | csv, json, parquet, orc | Tens of GB to hundreds of GB | Asynchronous | +| [INSERT INTO VALUES](./import-way/insert-into-manual.md) | Writing data via JDBC. | SQL | Simple testing | Synchronous | +| [INSERT INTO SELECT](./import-way/insert-into-manual.md) | Importing from an external source like a table in a catalog or files in s3. | SQL | Depending on memory size | Synchronous, Asynchronous via Job | | [Routine Load](./import-way/routine-load-manual.md) | Real-time import from Kafka | csv, json | Micro-batch import MB to GB | Asynchronous | -| [MySQL Load](./import-way/mysql-load-manual.md) | Import from local data | csv | Less than 10GB | Synchronous | -| [Group Commit](./import-way/group-commit-manual.md) | High-frequency small batch import | Depending on the import method used | Micro-batch import KB | - | \ No newline at end of file +| [MySQL Load](./import-way/mysql-load-manual.md) | Importing from local files. | csv | Less than 1GB | Synchronous | +| [Group Commit](./import-way/group-commit-manual.md) | Writing with high frequency. | Depending on the import method used | Micro-batch import KB | - | \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-way/insert-into-manual.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-way/insert-into-manual.md index 198c0be954..0bdf6b570b 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-way/insert-into-manual.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/import-way/insert-into-manual.md @@ -84,7 +84,7 @@ VALUES (1, "Emily", 25), (5, "Ava", 17); ``` -INSERT INTO 是一种同步导入方式,导入结果会直接返回给用户。 +INSERT INTO 是一种同步导入方式,导入结果会直接返回给用户。可以打开 [group commit](../import-way/group-commit-manual.md) 达到更高的性能。 ```JSON Query OK, 5 rows affected (0.308 sec) @@ -132,6 +132,10 @@ MySQL> SELECT COUNT(*) FROM testdb.test_table2; 1 row in set (0.071 sec) ``` +4. 可以使用 [JOB](../../scheduler/job-scheduler.md) 异步执行 INSERT。 + +5. 数据源可以是 [tvf](../../../lakehouse/file.md) 或者 [catalog](../../../lakehouse/database) 中的表。 + ### 查看导入作业 可以通过 show load 命令查看已完成的 INSERT INTO 任务。 @@ -382,6 +386,7 @@ INSERT INTO target_tbl SELECT k1,k2,k3 FROM hive.db1.source_tbl limit 100; INSERT 命令是同步命令,返回成功,即表示导入成功。 + ### 注意事项 - 必须保证外部数据源与 Doris 集群是可以互通,包括 BE 节点和外部数据源的网络是互通的。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md index e0337368e3..7e056884b3 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md @@ -26,24 +26,33 @@ under the License. Apache Doris 提供了多种导入和集成数据的方法,您可以使用合适的导入方式从各种源将数据导入到数据库中。Apache Doris 提供的数据导入方式可以分为四类: -1. **实时写入**:应用程序通过 HTTP 或者 JDBC 实时写入数据到 Doris 表中,适用于需要实时分析和查询的场景。 - * 极少量数据(5 分钟一次)时可以使用 [JDBC INSERT](./import-way/insert-into-manual.md) 写入数据。 - * 并发较高或者频次较高(大于 20 并发或者 1 分钟写入多次)时建议打开 [Group Commit](./import-way/group-commit-manual.md),使用 JDBC INSERT 或者 Stream Load 写入数据。 - * 吞吐较高时推荐使用 [Stream Load](./import-way/stream-load-manua) 通过 HTTP 写入数据。 +- **实时写入**:应用程序通过 HTTP 或者 JDBC 实时写入数据到 Doris 表中,适用于需要实时分析和查询的场景。 -2. **流式同步**:通过实时数据流(如 Flink、Kafka、事务数据库)将数据实时导入到 Doris 表中,适用于需要实时分析和查询的场景。 - * 可以使用 [Flink Doris Connector](../../ecosystem/flink-doris-connector.md) 将 Flink 的实时数据流写入到 Doris 表中。 - * 可以使用 [Routine Load](./import-way/routine-load-manual.md) 或者 [Doris Kafka Connector](../../ecosystem/doris-kafka-connector.md) 将 Kafka 的实时数据流写入到 Doris 表中。Routine Load 方式下,Doris 会调度任务将 Kafka 中的数据拉取并写入 Doris 中,目前支持 csv 和 json 格式的数据。Kafka Connector 方式下,由 Kafka 将数据写入到 Doris 中,支持 avro、json、csv、protobuf 格式的数据。 - * 可以使用 [Flink CDC](../../ecosystem/flink-doris-connector.md) 或 [ Datax](../../ecosystem/datax.md) 将事务数据库的 CDC 数据流写入到 Doris 中。 + - 极少量数据(5 分钟一次)时可以使用 [JDBC INSERT](./import-way/insert-into-manual.md) 写入数据。 -3. **批量导入**:将数据从外部存储系统(如 S3、HDFS、本地文件、NAS)批量加载到 Doris 表中,适用于非实时数据导入的需求。 - * 可以使用 [Broker Load](./import-way/broker-load-manual.md) 将 S3 和 HDFS 中的文件写入到 Doris 中。 - * 可以使用 [INSERT INTO SELECT](./import-way/insert-into-manual.md) 将 S3、HDFS 和 NAS 中的文件同步写入到 Doris 中,配合 [JOB](../scheduler/job-scheduler.md) 可以异步写入。 - * 可以使用 [Stream Load](./import-way/stream-load-manua) 或者 [Doris Streamloader](../../ecosystem/doris-streamloader.md) 将本地文件写入 Doris 中。 + - 并发较高或者频次较高(大于 20 并发或者 1 分钟写入多次)时建议打开 [Group Commit](./import-way/group-commit-manual.md),使用 JDBC INSERT 或者 Stream Load 写入数据。 -4. **外部数据源集成**:通过与外部数据源(如 Hive、JDBC、Iceberg 等)的集成,实现对外部数据的查询和部分数据导入到 Doris 表中。 - * 可以创建 [Catalog](../../lakehouse/lakehouse-overview.md) 读取外部数据源中的数据,使用 [INSERT INTO SELECT](./import-way/insert-into-manual.md) 将外部数据源中的数据同步写入到 Doris 中,配合 [JOB](../scheduler/job-scheduler.md) 可以异步写入。 - * 可以使用 [X2Doris](./migrate-data-from-other-olap.md) 将其他 AP 系统的数据迁移到 Doris 中。 + - 吞吐较高时推荐使用 [Stream Load](./import-way/stream-load-manua) 通过 HTTP 写入数据。 + +- **流式同步**:通过实时数据流(如 Flink、Kafka、事务数据库)将数据实时导入到 Doris 表中,适用于需要实时分析和查询的场景。 + + - 可以使用 [Flink Doris Connector](../../ecosystem/flink-doris-connector.md) 将 Flink 的实时数据流写入到 Doris 表中。 + + - 可以使用 [Routine Load](./import-way/routine-load-manual.md) 或者 [Doris Kafka Connector](../../ecosystem/doris-kafka-connector.md) 将 Kafka 的实时数据流写入到 Doris 表中。Routine Load 方式下,Doris 会调度任务将 Kafka 中的数据拉取并写入 Doris 中,目前支持 csv 和 json 格式的数据。Kafka Connector 方式下,由 Kafka 将数据写入到 Doris 中,支持 avro、json、csv、protobuf 格式的数据。 + + - 可以使用 [Flink CDC](../../ecosystem/flink-doris-connector.md) 或 [ Datax](../../ecosystem/datax.md) 将事务数据库的 CDC 数据流写入到 Doris 中。 + +- **批量导入**:将数据从外部存储系统(如 S3、HDFS、本地文件、NAS)批量加载到 Doris 表中,适用于非实时数据导入的需求。 + - 可以使用 [Broker Load](./import-way/broker-load-manual.md) 将 S3 和 HDFS 中的文件写入到 Doris 中。 + + - 可以使用 [INSERT INTO SELECT](./import-way/insert-into-manual.md) 将 S3、HDFS 和 NAS 中的文件同步写入到 Doris 中,配合 [JOB](../scheduler/job-scheduler.md) 可以异步写入。 + + - 可以使用 [Stream Load](./import-way/stream-load-manua) 或者 [Doris Streamloader](../../ecosystem/doris-streamloader.md) 将本地文件写入 Doris 中。 + +- **外部数据源集成**:通过与外部数据源(如 Hive、JDBC、Iceberg 等)的集成,实现对外部数据的查询和部分数据导入到 Doris 表中。 + - 可以创建 [Catalog](../../lakehouse/lakehouse-overview.md) 读取外部数据源中的数据,使用 [INSERT INTO SELECT](./import-way/insert-into-manual.md) 将外部数据源中的数据同步写入到 Doris 中,配合 [JOB](../scheduler/job-scheduler.md) 可以异步写入。 + + - 可以使用 [X2Doris](./migrate-data-from-other-olap.md) 将其他 AP 系统的数据迁移到 Doris 中。 Doris 的每个导入默认都是一个隐式事务,事务相关的更多信息请参考[事务](../transaction.md)。 @@ -53,10 +62,10 @@ Doris 的导入主要涉及数据源、数据格式、导入方式、错误数 | 导入方式 | 使用场景 | 支持的文件格式 | 单次导入数据量 | 导入模式 | | :-------------------------------------------- | :----------------------------------------- | ----------------------- | ----------------- | -------- | -| [Stream Load](./import-way/stream-load-manual) | 从本地数据导入 | csv、json、parquet、orc | 小于10GB | 同步 | +| [Stream Load](./import-way/stream-load-manual) | 导入本地文件或者应用程序写入 | csv、json、parquet、orc | 小于10GB | 同步 | | [Broker Load](./import-way/broker-load-manual.md) | 从对象存储、HDFS等导入 | csv、json、parquet、orc | 数十GB到数百 GB | 异步 | -| [INSERT INTO VALUES](./import-way/insert-into-manual.md) | <p>单条或小批量据量导入</p><p>通过JDBC等接口导入</p> | SQL | 简单测试用 | 同步 | -| [INSERT INTO SELECT](./import-way/insert-into-manual.md) | <p>Doris内部表之间数据导入</p><p>外部表导入</p> | SQL | 根据内存大小而定 | 同步 | +| [INSERT INTO VALUES](./import-way/insert-into-manual.md) | 通过JDBC等接口导入 | SQL | 简单测试用 | 同步 | +| [INSERT INTO SELECT](./import-way/insert-into-manual.md) | 可以导入外部表或者对象存储、HDFS中的文件 | SQL | 根据内存大小而定 | 同步 | | [Routine Load](./import-way/routine-load-manual.md) | 从kakfa实时导入 | csv、json | 微批导入 MB 到 GB | 异步 | | [MySQL Load](./import-way/mysql-load-manual.md) | 从本地数据导入 | csv | 小于10GB | 同步 | | [Group Commit](./import-way/group-commit-manual.md) | 高频小批量导入 | 根据使用的导入方式而定 | 微批导入KB | - | --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org