This is an automated email from the ASF dual-hosted git repository. liaoxin pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new e8253fe62b4 [fix](load) fix the format of load best practices (#2335) e8253fe62b4 is described below commit e8253fe62b48e0a2ad5a4a9e4fbbda4cd981f947 Author: Xin Liao <liao...@selectdb.com> AuthorDate: Mon Apr 28 11:22:11 2025 +0800 [fix](load) fix the format of load best practices (#2335) --- docs/data-operate/import/load-best-practices.md | 16 ++++++++-------- .../current/data-operate/import/load-best-practices.md | 16 ++++++++-------- .../data-operate/import/load-best-practices.md | 16 ++++++++-------- .../data-operate/import/load-best-practices.md | 16 ++++++++-------- .../data-operate/import/load-best-practices.md | 16 ++++++++-------- .../data-operate/import/load-best-practices.md | 16 ++++++++-------- 6 files changed, 48 insertions(+), 48 deletions(-) diff --git a/docs/data-operate/import/load-best-practices.md b/docs/data-operate/import/load-best-practices.md index 8e563aed778..efe4736f20e 100644 --- a/docs/data-operate/import/load-best-practices.md +++ b/docs/data-operate/import/load-best-practices.md @@ -24,32 +24,32 @@ specific language governing permissions and limitations under the License. --> -# Table Model Selection +## Table Model Selection It is recommended to prioritize using the Duplicate Key model, which offers advantages in both data loading and query performance compared to other models. For more information, please refer to: [Data Model](../../table-design/data-model/overview) -# Partition and Bucket Configuration +## Partition and Bucket Configuration It is recommended to keep the size of a tablet between 1-10GB. Tablets that are too small may lead to poor aggregation performance and increase metadata management overhead; tablets that are too large may hinder replica migration and repair. For details, please refer to: [Data Distribution](../../table-design/data-partitioning/data-distribution). -# Random Bucketing +## Random Bucketing When using Random bucketing, you can enable single-tablet loading mode by setting load_to_single_tablet to true. This mode can improve data loading concurrency and throughput while reducing write amplification during large-scale data loading. For details, refer to: [Random Bucketing](../../table-design/data-partitioning/data-bucketing#random-bucketing) -# Batch Loading +## Batch Loading Client-side batching: It is recommended to batch data (from several MB to GB in size) on the client side before loading. High-frequency small loads will cause frequent compaction, leading to severe write amplification issues. Server-side batching: For high-concurrency small data volume loading, it is recommended to enable [Group Commit](group-commit-manual.md) to implement batching on the server side. -# Partition Loading +## Partition Loading It is recommended to load data from only a few partitions at a time. Loading from too many partitions simultaneously will increase memory usage and may cause performance issues. Each tablet in Doris has an active Memtable in memory, which is flushed to disk when it reaches a certain size. To prevent process OOM, when the active Memtable's memory usage is too high, it will trigger early flushing, resulting in many small files and affecting loading performance. -# Large-scale Data Batch Loading +## Large-scale Data Batch Loading When dealing with a large number of files or large data volumes, it is recommended to load in batches to avoid high retry costs in case of loading failures and to reduce system resource impact. For Broker Load, it is recommended not to exceed 100GB per batch. For large local data files, you can use Doris's streamloader tool, which automatically performs batch loading. -# Broker Load Concurrency +## Broker Load Concurrency Compressed files/Parquet/ORC files: It is recommended to split files into multiple smaller files for loading to achieve higher concurrency. @@ -57,6 +57,6 @@ Uncompressed CSV and JSON files: Doris will automatically split files and load t For concurrency strategies, please refer to: [Broker Load Configuration Parameters](./import-way/broker-load-manual#Related-Configurations) -# Stream Load Concurrency +## Stream Load Concurrency It is recommended to keep Stream load concurrency per BE under 128 (controlled by BE's webserver_num_workers parameter). High concurrency may cause webserver thread exhaustion and affect loading performance. Particularly when a single BE's concurrency exceeds 512 (doris_max_remote_scanner_thread_pool_thread_num parameter), it may cause the BE process to hang. diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-best-practices.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-best-practices.md index de35ad83b29..dc310523c00 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-best-practices.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-best-practices.md @@ -24,32 +24,32 @@ specific language governing permissions and limitations under the License. --> -# 表模型选择 +## 表模型选择 建议优先考虑使用明细模型, 明细模型在数据导入和查询性能方面相比其他模型都具有优势。如需了解更多信息,请参考:[数据模型](../../table-design/data-model/overview) -# 分区分桶配置 +## 分区分桶配置 建议一个 tablet 的大小在 1-10G 范围内。过小的 tablet 可能导致聚合效果不佳,增加元数据管理压力;过大的 tablet 不利于副本迁移、补齐。详细请参考:[数据分布](../../table-design/data-partitioning/data-distribution)。 -# Random 分桶 +## Random 分桶 在使用Random分桶时,可以通过设置load_to_single_tablet为true来启用单分片导入模式。这种模式在大规模数据导入过程中,能够提升数据导入的并发度和吞吐量,减少写放大问题。详细参考:[Random分桶](../../table-design/data-partitioning/data-bucketing#random-分桶) -# 攒批导入 +## 攒批导入 客户端攒批:建议将数据在客户端进行攒批(数MB到数GB大小)后再进行导入,高频小导入会频繁做compaction,导致严重的写放大问题。 服务端攒批:对于高并发小数据量导入,建议打开[Group Commit](group-commit-manual.md),在服务端实现攒批导入。 -# 分区导入 +## 分区导入 每次导入建议只导入少量分区的数据。过多的分区同时导入会增加内存占用,并可能导致性能问题。Doris每个tablet在内存中有一个活跃的Memtable,每个Memtable达到一定大小时才会下刷到磁盘。为了避免进程OOM,当活跃的Memtable占用内存过高时,会提前触发Memtable下刷,导致产生大量小文件,同时会影响导入的性能。 -# 大规模数据分批导入 +## 大规模数据分批导入 需要导入的文件数较多、数据量很大时,建议分批进行导入,避免导入出错后重试代价太大,同时减少对系统资源的冲击。对 Broker Load 每批次导入的数据量建议不超过100G。对于本地的大数据量文件,可以使用Doris提供的streamloader工具进行导入,该工具会自动进行分批导入。 -# Broker Load 导入并发数 +## Broker Load 导入并发数 压缩文件/Parquet/ORC文件:建议将文件分割成多个小文件进行导入,以实现多并发导入。 @@ -57,6 +57,6 @@ under the License. 并发数策略请参考:[Broker Load导入配置参数](./import-way/broker-load-manual#导入配置参数) -# Stream load并发导入 +## Stream load并发导入 Stream load单BE上的并发数建议不超过128(由BE的webserver_num_workers参数控制)。过高的并发数可能导致webserver线程数不够用,影响导入性能。特别是当单个BE的并发数超过512(doris_max_remote_scanner_thread_pool_thread_num参数)时,可能会导致BE进程卡住。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/import/load-best-practices.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/import/load-best-practices.md index de35ad83b29..dc310523c00 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/import/load-best-practices.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/import/load-best-practices.md @@ -24,32 +24,32 @@ specific language governing permissions and limitations under the License. --> -# 表模型选择 +## 表模型选择 建议优先考虑使用明细模型, 明细模型在数据导入和查询性能方面相比其他模型都具有优势。如需了解更多信息,请参考:[数据模型](../../table-design/data-model/overview) -# 分区分桶配置 +## 分区分桶配置 建议一个 tablet 的大小在 1-10G 范围内。过小的 tablet 可能导致聚合效果不佳,增加元数据管理压力;过大的 tablet 不利于副本迁移、补齐。详细请参考:[数据分布](../../table-design/data-partitioning/data-distribution)。 -# Random 分桶 +## Random 分桶 在使用Random分桶时,可以通过设置load_to_single_tablet为true来启用单分片导入模式。这种模式在大规模数据导入过程中,能够提升数据导入的并发度和吞吐量,减少写放大问题。详细参考:[Random分桶](../../table-design/data-partitioning/data-bucketing#random-分桶) -# 攒批导入 +## 攒批导入 客户端攒批:建议将数据在客户端进行攒批(数MB到数GB大小)后再进行导入,高频小导入会频繁做compaction,导致严重的写放大问题。 服务端攒批:对于高并发小数据量导入,建议打开[Group Commit](group-commit-manual.md),在服务端实现攒批导入。 -# 分区导入 +## 分区导入 每次导入建议只导入少量分区的数据。过多的分区同时导入会增加内存占用,并可能导致性能问题。Doris每个tablet在内存中有一个活跃的Memtable,每个Memtable达到一定大小时才会下刷到磁盘。为了避免进程OOM,当活跃的Memtable占用内存过高时,会提前触发Memtable下刷,导致产生大量小文件,同时会影响导入的性能。 -# 大规模数据分批导入 +## 大规模数据分批导入 需要导入的文件数较多、数据量很大时,建议分批进行导入,避免导入出错后重试代价太大,同时减少对系统资源的冲击。对 Broker Load 每批次导入的数据量建议不超过100G。对于本地的大数据量文件,可以使用Doris提供的streamloader工具进行导入,该工具会自动进行分批导入。 -# Broker Load 导入并发数 +## Broker Load 导入并发数 压缩文件/Parquet/ORC文件:建议将文件分割成多个小文件进行导入,以实现多并发导入。 @@ -57,6 +57,6 @@ under the License. 并发数策略请参考:[Broker Load导入配置参数](./import-way/broker-load-manual#导入配置参数) -# Stream load并发导入 +## Stream load并发导入 Stream load单BE上的并发数建议不超过128(由BE的webserver_num_workers参数控制)。过高的并发数可能导致webserver线程数不够用,影响导入性能。特别是当单个BE的并发数超过512(doris_max_remote_scanner_thread_pool_thread_num参数)时,可能会导致BE进程卡住。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/import/load-best-practices.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/import/load-best-practices.md index de35ad83b29..dc310523c00 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/import/load-best-practices.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/import/load-best-practices.md @@ -24,32 +24,32 @@ specific language governing permissions and limitations under the License. --> -# 表模型选择 +## 表模型选择 建议优先考虑使用明细模型, 明细模型在数据导入和查询性能方面相比其他模型都具有优势。如需了解更多信息,请参考:[数据模型](../../table-design/data-model/overview) -# 分区分桶配置 +## 分区分桶配置 建议一个 tablet 的大小在 1-10G 范围内。过小的 tablet 可能导致聚合效果不佳,增加元数据管理压力;过大的 tablet 不利于副本迁移、补齐。详细请参考:[数据分布](../../table-design/data-partitioning/data-distribution)。 -# Random 分桶 +## Random 分桶 在使用Random分桶时,可以通过设置load_to_single_tablet为true来启用单分片导入模式。这种模式在大规模数据导入过程中,能够提升数据导入的并发度和吞吐量,减少写放大问题。详细参考:[Random分桶](../../table-design/data-partitioning/data-bucketing#random-分桶) -# 攒批导入 +## 攒批导入 客户端攒批:建议将数据在客户端进行攒批(数MB到数GB大小)后再进行导入,高频小导入会频繁做compaction,导致严重的写放大问题。 服务端攒批:对于高并发小数据量导入,建议打开[Group Commit](group-commit-manual.md),在服务端实现攒批导入。 -# 分区导入 +## 分区导入 每次导入建议只导入少量分区的数据。过多的分区同时导入会增加内存占用,并可能导致性能问题。Doris每个tablet在内存中有一个活跃的Memtable,每个Memtable达到一定大小时才会下刷到磁盘。为了避免进程OOM,当活跃的Memtable占用内存过高时,会提前触发Memtable下刷,导致产生大量小文件,同时会影响导入的性能。 -# 大规模数据分批导入 +## 大规模数据分批导入 需要导入的文件数较多、数据量很大时,建议分批进行导入,避免导入出错后重试代价太大,同时减少对系统资源的冲击。对 Broker Load 每批次导入的数据量建议不超过100G。对于本地的大数据量文件,可以使用Doris提供的streamloader工具进行导入,该工具会自动进行分批导入。 -# Broker Load 导入并发数 +## Broker Load 导入并发数 压缩文件/Parquet/ORC文件:建议将文件分割成多个小文件进行导入,以实现多并发导入。 @@ -57,6 +57,6 @@ under the License. 并发数策略请参考:[Broker Load导入配置参数](./import-way/broker-load-manual#导入配置参数) -# Stream load并发导入 +## Stream load并发导入 Stream load单BE上的并发数建议不超过128(由BE的webserver_num_workers参数控制)。过高的并发数可能导致webserver线程数不够用,影响导入性能。特别是当单个BE的并发数超过512(doris_max_remote_scanner_thread_pool_thread_num参数)时,可能会导致BE进程卡住。 diff --git a/versioned_docs/version-2.1/data-operate/import/load-best-practices.md b/versioned_docs/version-2.1/data-operate/import/load-best-practices.md index 8e563aed778..efe4736f20e 100644 --- a/versioned_docs/version-2.1/data-operate/import/load-best-practices.md +++ b/versioned_docs/version-2.1/data-operate/import/load-best-practices.md @@ -24,32 +24,32 @@ specific language governing permissions and limitations under the License. --> -# Table Model Selection +## Table Model Selection It is recommended to prioritize using the Duplicate Key model, which offers advantages in both data loading and query performance compared to other models. For more information, please refer to: [Data Model](../../table-design/data-model/overview) -# Partition and Bucket Configuration +## Partition and Bucket Configuration It is recommended to keep the size of a tablet between 1-10GB. Tablets that are too small may lead to poor aggregation performance and increase metadata management overhead; tablets that are too large may hinder replica migration and repair. For details, please refer to: [Data Distribution](../../table-design/data-partitioning/data-distribution). -# Random Bucketing +## Random Bucketing When using Random bucketing, you can enable single-tablet loading mode by setting load_to_single_tablet to true. This mode can improve data loading concurrency and throughput while reducing write amplification during large-scale data loading. For details, refer to: [Random Bucketing](../../table-design/data-partitioning/data-bucketing#random-bucketing) -# Batch Loading +## Batch Loading Client-side batching: It is recommended to batch data (from several MB to GB in size) on the client side before loading. High-frequency small loads will cause frequent compaction, leading to severe write amplification issues. Server-side batching: For high-concurrency small data volume loading, it is recommended to enable [Group Commit](group-commit-manual.md) to implement batching on the server side. -# Partition Loading +## Partition Loading It is recommended to load data from only a few partitions at a time. Loading from too many partitions simultaneously will increase memory usage and may cause performance issues. Each tablet in Doris has an active Memtable in memory, which is flushed to disk when it reaches a certain size. To prevent process OOM, when the active Memtable's memory usage is too high, it will trigger early flushing, resulting in many small files and affecting loading performance. -# Large-scale Data Batch Loading +## Large-scale Data Batch Loading When dealing with a large number of files or large data volumes, it is recommended to load in batches to avoid high retry costs in case of loading failures and to reduce system resource impact. For Broker Load, it is recommended not to exceed 100GB per batch. For large local data files, you can use Doris's streamloader tool, which automatically performs batch loading. -# Broker Load Concurrency +## Broker Load Concurrency Compressed files/Parquet/ORC files: It is recommended to split files into multiple smaller files for loading to achieve higher concurrency. @@ -57,6 +57,6 @@ Uncompressed CSV and JSON files: Doris will automatically split files and load t For concurrency strategies, please refer to: [Broker Load Configuration Parameters](./import-way/broker-load-manual#Related-Configurations) -# Stream Load Concurrency +## Stream Load Concurrency It is recommended to keep Stream load concurrency per BE under 128 (controlled by BE's webserver_num_workers parameter). High concurrency may cause webserver thread exhaustion and affect loading performance. Particularly when a single BE's concurrency exceeds 512 (doris_max_remote_scanner_thread_pool_thread_num parameter), it may cause the BE process to hang. diff --git a/versioned_docs/version-3.0/data-operate/import/load-best-practices.md b/versioned_docs/version-3.0/data-operate/import/load-best-practices.md index 8e563aed778..efe4736f20e 100644 --- a/versioned_docs/version-3.0/data-operate/import/load-best-practices.md +++ b/versioned_docs/version-3.0/data-operate/import/load-best-practices.md @@ -24,32 +24,32 @@ specific language governing permissions and limitations under the License. --> -# Table Model Selection +## Table Model Selection It is recommended to prioritize using the Duplicate Key model, which offers advantages in both data loading and query performance compared to other models. For more information, please refer to: [Data Model](../../table-design/data-model/overview) -# Partition and Bucket Configuration +## Partition and Bucket Configuration It is recommended to keep the size of a tablet between 1-10GB. Tablets that are too small may lead to poor aggregation performance and increase metadata management overhead; tablets that are too large may hinder replica migration and repair. For details, please refer to: [Data Distribution](../../table-design/data-partitioning/data-distribution). -# Random Bucketing +## Random Bucketing When using Random bucketing, you can enable single-tablet loading mode by setting load_to_single_tablet to true. This mode can improve data loading concurrency and throughput while reducing write amplification during large-scale data loading. For details, refer to: [Random Bucketing](../../table-design/data-partitioning/data-bucketing#random-bucketing) -# Batch Loading +## Batch Loading Client-side batching: It is recommended to batch data (from several MB to GB in size) on the client side before loading. High-frequency small loads will cause frequent compaction, leading to severe write amplification issues. Server-side batching: For high-concurrency small data volume loading, it is recommended to enable [Group Commit](group-commit-manual.md) to implement batching on the server side. -# Partition Loading +## Partition Loading It is recommended to load data from only a few partitions at a time. Loading from too many partitions simultaneously will increase memory usage and may cause performance issues. Each tablet in Doris has an active Memtable in memory, which is flushed to disk when it reaches a certain size. To prevent process OOM, when the active Memtable's memory usage is too high, it will trigger early flushing, resulting in many small files and affecting loading performance. -# Large-scale Data Batch Loading +## Large-scale Data Batch Loading When dealing with a large number of files or large data volumes, it is recommended to load in batches to avoid high retry costs in case of loading failures and to reduce system resource impact. For Broker Load, it is recommended not to exceed 100GB per batch. For large local data files, you can use Doris's streamloader tool, which automatically performs batch loading. -# Broker Load Concurrency +## Broker Load Concurrency Compressed files/Parquet/ORC files: It is recommended to split files into multiple smaller files for loading to achieve higher concurrency. @@ -57,6 +57,6 @@ Uncompressed CSV and JSON files: Doris will automatically split files and load t For concurrency strategies, please refer to: [Broker Load Configuration Parameters](./import-way/broker-load-manual#Related-Configurations) -# Stream Load Concurrency +## Stream Load Concurrency It is recommended to keep Stream load concurrency per BE under 128 (controlled by BE's webserver_num_workers parameter). High concurrency may cause webserver thread exhaustion and affect loading performance. Particularly when a single BE's concurrency exceeds 512 (doris_max_remote_scanner_thread_pool_thread_num parameter), it may cause the BE process to hang. --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org