This is an automated email from the ASF dual-hosted git repository. zhangchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 13936a420d6 [doc](update) rewrite update overview doc (#1393) 13936a420d6 is described below commit 13936a420d6443551c23317d0151d7724306dc9d Author: zhannngchen <zhangc...@selectdb.com> AuthorDate: Mon Dec 9 20:42:31 2024 +0800 [doc](update) rewrite update overview doc (#1393) # Versions - [x] dev - [x] 3.0 - [x] 2.1 - [x] 2.0 # Languages - [x] Chinese - [x] English --- docs/data-operate/update/update-overview.md | 81 ++++++++++--------- .../current/data-operate/update/update-overview.md | 86 ++++++++++++--------- .../data-operate/update/update-overview.md | 88 ++++++++++++--------- .../data-operate/update/update-overview.md | 90 ++++++++++++---------- .../data-operate/update/update-overview.md | 88 ++++++++++++--------- .../data-operate/update/update-overview.md | 80 +++++++++++-------- .../data-operate/update/update-overview.md | 80 +++++++++++-------- .../data-operate/update/update-overview.md | 82 +++++++++++--------- 8 files changed, 385 insertions(+), 290 deletions(-) diff --git a/docs/data-operate/update/update-overview.md b/docs/data-operate/update/update-overview.md index aff46a96c55..1515852091d 100644 --- a/docs/data-operate/update/update-overview.md +++ b/docs/data-operate/update/update-overview.md @@ -23,8 +23,33 @@ KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> +Data update refers to modifying the value columns in data records with the same key. The handling of data updates varies for different data models: -Data update primarily refers to the modification of the value column of data with the same key. For the primary(unique) key model, this update involves replacing the existing value, while for the aggregate model, it involves aggregating the values in the value column. +- **Primary Key (Unique) Model**: The primary key model is specifically designed for data updates. Doris supports two storage implementations: Merge-on-Read (MoR) and Merge-on-Write (MoW). MoR optimizes write performance, while MoW provides better analytical performance. From Doris version 2.1, the default storage method is MoW. The primary key model supports using the `UPDATE` statement for small data updates and also supports batch updates through data loading. Loading methods include [...] + +- **Aggregate Model**: In the aggregate model, data update is a special use case. When the aggregate function is set to REPLACE or REPLACE_IF_NOT_NULL, data updates can be achieved. The aggregate model only supports updates based on data loading and does not support using the `UPDATE` statement. By setting the aggregate function to REPLACE_IF_NOT_NULL, partial column update capability can be achieved. + +By understanding the data update methods of different models, you can better choose the appropriate update strategy to meet specific business needs. + +## Comparison of Update Capabilities for Different Models/Implementations + +### Performance Comparison +| | Unique Key MoW | Unique Key MoR [...] +|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------ [...] +| Import Speed | Deduplication is performed during import. Small-batch real-time writes incur approximately 10%-20% performance loss compared to MoR, while large-batch imports (e.g., tens or hundreds of millions of records) have about 30%-50% performance loss compared to MoR. | Similar to Duplicate Key [...] +| Query Speed | Similar to Duplicate Key | Requires deduplication during queries, with query time approximately 3-10 times that of M [...] +| Predicate Pushdown | Supported | Not Supported [...] +| Resource Consumption | - **Import Resource Consumption**: Consumes approximately 10%-30% more CPU compared to Duplicate Key/Unique Key MoR.<br /> - **Query Resource Consumption**: Similar to Duplicate Key with no additional resource consumption.<br /> - **Compaction Resource Consumption**: Higher memory and CPU usage compared to Duplicate Key, specific usage depends on data characteristics and volume. | - **Import Resource Consumption**: Similar to Duplicate Key with no additional reso [...] + +### Feature Support Comparison +| | Unique Key MoW | Unique Key MoR | Aggregate Key | +|----------------|----------------|----------------|------------------------------------| +| UPDATE | Supported | Supported | Not Supported | +| DELETE | Supported | Supported | Not Supported | +| sequence column| Supported | Supported | Not Supported | +| delete_sign | Supported | Supported | Not Supported | +| Partial Column Updates | Supported | Not Supported | Supported (can't update null value) | +| Inverted Index | Supported | Not Supported | Not Supported | ## Update in Primary Key (Unique) Model @@ -35,14 +60,14 @@ By default, in Doris 2.0, the unique key model is still based on MoR. To create ```sql CREATE TABLE IF NOT EXISTS example_tbl_unique_merge_on_write ( - `user_id` LARGEINT NOT NULL, - `username` VARCHAR(50) NOT NULL , - `city` VARCHAR(20), - `age` SMALLINT, - `sex` TINYINT, - `phone` LARGEINT, - `address` VARCHAR(500), - `register_time` DATETIME + `user_id` LARGEINT NOT NULL, + `username` VARCHAR(50) NOT NULL , + `city` VARCHAR(20), + `age` SMALLINT, + `sex` TINYINT, + `phone` LARGEINT, + `address` VARCHAR(500), + `register_time` DATETIME ) UNIQUE KEY(`user_id`, `username`) DISTRIBUTED BY HASH(`user_id`) BUCKETS 1 @@ -53,7 +78,7 @@ PROPERTIES ( ``` :::caution -Starting from Doris 2.1, MoW is the default mode for the unique key model. So, if you are using Doris 2.1 or higher version, make sure to read the relevant table creation documentation. +Starting from Doris 2.1, MoW is the default mode for the unique key model. So, if you are using Doris 2.1 or a higher version, make sure to read the relevant table creation documentation. ::: ### Two Update Methods in Unique Key Model @@ -72,13 +97,13 @@ Doris supports multiple data load methods, including Stream Load, Broker Load, R We will provide detailed explanations of these two update methods in the documentation: [Update in Unique Key Model](../update/unique-update) and [Load Update in Unique Key Model](../update/update-of-unique-model). -### Update Transactions in Unique Key Model +### Concurrency Control for Primary Key Model Updates -#### Updating Data Using the `UPDATE` Statement +#### Using `UPDATE` Statements to Update Data -By default, Doris does not allow multiple `UPDATE` operations on the same table to occur concurrently. The `UPDATE` statement uses table-level locking to ensure transactional consistency. +By default, Doris does not allow multiple `UPDATE` operations on the same table at the same time. The `UPDATE` statement ensures isolation through table-level locking. -Users can adjust concurrency limits by modifying the FE configuration `enable_concurrent_update=true`. When concurrency limits are relaxed, the `UPDATE` statement will no longer provide transactional guarantees. +Users can adjust the concurrency limit by modifying the FE configuration `enable_concurrent_update=true`. When the concurrency limit is relaxed, if multiple `UPDATE` statements update the same row of data, the result will be undefined. #### Batch Updates Based on Load @@ -94,26 +119,12 @@ For more detailed information on transaction mechanisms, refer to the documentat The update in the aggregate model refers to the process of generating new aggregate values by combining new column values with existing aggregate values, according to the requirements of the aggregate functions. -New Agg Value = Agg Func ( Old Agg Value, New Column Value) - -The update in the aggregate model is only supported through load methods and does not support the use of Update statements. When defining a table in the aggregate model, if the aggregation function for the value column is defined as REPLACE_IF_NULL, it indirectly achieves partial column update capabilities similar to the unique key model. For more details, please refer to the documentation on [Load Update in the Aggregate Model](../update/update-of-aggregate-model). +New Agg Value = Agg Func (Old Agg Value, New Column Value) -## Comparison of Update Capabilities for Different Models/Implementations +The update in the aggregate model is only supported through load methods and does not support the use of Update statements. When defining a table in the aggregate model, if the aggregation function for the value column is defined as REPLACE_IF_NOT_NULL, it indirectly achieves partial column update capabilities similar to the unique key model. For more details, please refer to the documentation on [Load Update in the Aggregate Model](../update/update-of-aggregate-model). -### Performance Comparison -| | Unique Key MoW | Unique Key MoR [...] -|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------ [...] -| Import Speed | Deduplication is performed during import. Small-batch real-time writes incur approximately 10%-20% performance loss compared to MoR, while large-batch imports (e.g., tens or hundreds of millions of records) have about 30%-50% performance loss compared to MoR. | Similar to Duplicate Key [...] -| Query Speed | Similar to Duplicate Key | Requires deduplication during queries, with query time approximately 3-10 times that of M [...] -| Predicate Pushdown | Supported | Not Supported [...] -| Resource Consumption | - **Import Resource Consumption**: Consumes approximately 10%-30% more CPU compared to Duplicate Key/Unique Key MoR.<br /> - **Query Resource Consumption**: Similar to Duplicate Key with no additional resource consumption.<br /> - **Compaction Resource Consumption**: Higher memory and CPU usage compared to Duplicate Key, specific usage depends on data characteristics and volume. | - **Import Resource Consumption**: Similar to Duplicate Key with no additional reso [...] - -### Feature Support Comparison -| | Unique Key MoW | Unique Key MoR | Aggregate Key | -|----------------|----------------|----------------|------------------------------------| -| UPDATE | Supported | Supported | Not Supported | -| DELETE | Supported | Supported | Not Supported | -| sequence column| Supported | Supported | Not Supported | -| delete_sign | Supported | Supported | Not Supported | -| Partial Column Updates | Supported | Not Supported | Supported(can't update null value) | -| Inverted Index | Supported | Not Supported | Not Supported | +## Recommendations for Choosing Between Primary Key and Aggregate Models +- For most scenarios that require data updates, it is recommended to **prefer the primary key model**. Examples include synchronizing from TP databases to Doris via CDC, user profiling, and audience targeting. +- The following scenarios are recommended to use the aggregate model: + 1. Some fields need to be aggregated as metrics, while others need to be updated. + 2. Scenarios where partial column updates are needed, while being very sensitive to write performance and having low requirements for query latency, it is recommended to use the aggregate table with the REPLACE_IF_NOT_NULL aggregate function. \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/update/update-overview.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/update/update-overview.md index f7cc267b6de..fb4adc265ba 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/update/update-overview.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/update/update-overview.md @@ -24,7 +24,33 @@ specific language governing permissions and limitations under the License. --> -数据更新,主要指针对相同 Key 的数据 Value 列的值的更新,这个更新对于主键模型来说,就是替换,对于聚合模型来说,就是如何完成针对 value 列上的聚合。 +数据更新是指对具有相同 key 的数据记录中的 value 列进行修改。对于不同的数据模型,数据更新的处理方式有所不同: + +- **主键(Unique)模型**:主键模型是专门为数据更新设计的一种数据模型。Doris 支持两种存储方式:Merge-on-Read(MoR)和 Merge-on-Write(MoW)。MoR 优化了写入性能,而 MoW 则提供了更好的分析性能。从 Doris 2.1 版本开始,默认存储方式为 MoW。主键模型支持使用 `UPDATE` 语句进行少量数据更新,也支持通过导入方式进行批量更新。导入方式包括 Stream Load、Broker Load、Routine Load 和 Insert Into 等,所有导入操作都遵循“UPSERT”语义,即如果记录不存在则插入,存在则更新。更新操作支持整行更新和部分列更新,默认为整行更新。 + +- **聚合(Aggregate)模型**:在聚合模型中,数据更新是一种特殊用法。当聚合函数设置为 REPLACE 或 REPLACE_IF_NOT_NULL 时,可以实现数据更新。聚合模型仅支持基于导入方式的更新,不支持使用 `UPDATE` 语句。通过设置聚合函数为 REPLACE_IF_NOT_NULL,可以实现部分列更新的能力。 + +通过对不同模型的数据更新方式的理解,可以更好地选择适合的更新策略,以满足具体的业务需求。 + +## 不同模型/实现的更新能力对比 + +### 性能对比 +| | Unique Key MoW | Unique Key MoR | Aggregate Key | +|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| +| 导入速度 | 导入过程中进行数据去重,小批量实时写入相比 MoR 约有 10%-20% 的性能损失,大批量导入(例如千万级/亿级数据)相比 MoR 约有 30%-50% 的性能损失 | 与 Duplicate Key 接近 | 与 Duplicate Key 接近 | +| 查询速度 | 与 Duplicate Key 接近 | 需要在查询期间进行去重,查询耗时约为 MoW 的 3-10 倍 | 如果聚合函数为 REPLACE/REPLACE_IF_NOT_NULL,查询速度与 MoR 接近 | +| 谓词下推 | 支持 | 不支持 | 不支持 | +| 资源消耗 | - **导入资源消耗**:相比 Duplicate Key/Unique Key MoR,约额外消耗 10%-30% 的 CPU。<br /> - **查询资源消耗**:与 Duplicate Key 接近,无额外资源消耗。<br /> - **Compaction 资源消耗**:相比 Duplicate Key,消耗更多内存和 CPU,具体取决于数据特征和数据量。 | - **导入资源消耗**:与 Duplicate Key 相近,无额外资源消耗。<br /> - **查询资源消耗**:相比 Duplicate Key/Unique Key MoW,查询时额外消耗更多的 CPU 和内存。<br /> - **Compaction 资源消耗**:相比 Duplicate Key,需更多内存和 CPU,具体数值取决于数据特征和数据量。 | 与 Unique Key MoR 相同 | + +### 功能支持对比 +| | Unique Key MoW | Unique Key MoR | Aggregate Key | +|----------------|----------------|----------------|----------------| +| UPDATE | 支持 | 支持 | 不支持 | +| DELETE | 支持 | 支持 | 不支持 | +| sequence 列 | 支持 | 支持 | 不支持 | +| delete_sign | 支持 | 支持 | 不支持 | +| 部分列更新 | 支持 | 不支持 | 支持(但无法更新 null 值) | +| 倒排索引 | 支持 | 不支持 | 不支持 | ## 主键(Unique)模型的更新 @@ -35,14 +61,14 @@ Doris 主键 (unique) 模型,从 Doris 2.0 开始,除了原来的 Merge-on-R ```sql CREATE TABLE IF NOT EXISTS example_tbl_unique_merge_on_write ( - `user_id` LARGEINT NOT NULL, - `username` VARCHAR(50) NOT NULL , - `city` VARCHAR(20), - `age` SMALLINT, - `sex` TINYINT, - `phone` LARGEINT, - `address` VARCHAR(500), - `register_time` DATETIME + `user_id` LARGEINT NOT NULL, + `username` VARCHAR(50) NOT NULL , + `city` VARCHAR(20), + `age` SMALLINT, + `sex` TINYINT, + `phone` LARGEINT, + `address` VARCHAR(500), + `register_time` DATETIME ) UNIQUE KEY(`user_id`, `username`) DISTRIBUTED BY HASH(`user_id`) BUCKETS 1 @@ -58,9 +84,9 @@ PROPERTIES ( ### 主键模型的两种更新方式 -#### 使用`UPDATE`语句更新 +#### 使用 `UPDATE` 语句更新 -无论是 MoR 还是 MoW,语义都是完成对指定列的更新。单次UPDATE的耗时会随着被更新的数据量的增加而增长。 +无论是 MoR 还是 MoW,语义都是完成对指定列的更新。单次 UPDATE 的耗时会随着被更新的数据量的增加而增长。 #### 基于导入的批量更新 @@ -72,13 +98,13 @@ Doris 支持多种数据导入方式,包括 Stream Load、Broker Load、Routin 我们会分别在文档 [主键模型的 Update 更新](../update/unique-update) 和 [主键模型的导入更新](../update/update-of-unique-model) 详细介绍两种更新方式。 -### 主键模型的更新事务 +### 主键模型的更新并发控制 -#### 使用`UPDATE`语句更新数据 +#### 使用 `UPDATE` 语句更新数据 -默认情况下,Doris 不允许在同一时间对同一张表进行多个`UPDATE`操作。`UPDATE`语句通过表级锁来确保事务的一致性。 +默认情况下,Doris 不允许在同一时间对同一张表进行多个 `UPDATE` 操作。`UPDATE` 语句通过表级锁来确保隔离性。 -用户可以通过修改 FE 配置`enable_concurrent_update=true`来调整并发限制。当放宽并发限制时,`UPDATE`语句将不再提供事务保证。 +用户可以通过修改 FE 配置 `enable_concurrent_update=true` 来调整并发限制。当放宽并发限制时,多个 `UPDATE` 语句如果更新同一行数据,结果将是未定义的。 #### 基于导入的批量更新 @@ -88,32 +114,18 @@ Doris 对所有导入更新操作提供原子性保障,即每次导入数据 由于多个并发导入更新的提交顺序可能无法预期,若这些并发导入涉及相同主键的更新,则其生效顺序也无法预知,最终的可见结果会因此存在不确定性。为解决此问题,Doris 提供了 sequence 列机制,允许用户在并发导入更新时为每一行数据指定版本,以便明确控制并发更新的结果顺序,实现确定性。 -我们将在文档[主键模型的更新事务](../update/unique-update-transaction.md) 中对事务机制进行详细介绍 +我们将在文档 [主键模型的更新事务](../update/unique-update-transaction.md) 中对事务机制进行详细介绍。 ## 聚合(Aggregate)模型的更新 -聚合模型的更新,主要是指的是用新的列值和旧的聚合值按照聚合函数的要求产出新的聚合值。 +聚合模型的更新,主要是指用新的列值和旧的聚合值按照聚合函数的要求产出新的聚合值。 New Agg Value = Agg Func ( Old Agg Value, New Column Value) -聚合模型只支持基于导入方式的更新,不支持使用 Update 语句更新。在定义聚合模型表的时候,如果把 value 列的聚合函数定义为 REPLACE_IF_NULL,也可以间接实现类似主键表的部分列更新能力。更多内容,请查看 [聚合模型的导入更新](../update/update-of-aggregate-model)。 - -## 不同模型/实现的更新能力对比 - -### 性能对比 -| | Unique Key MoW | Unique Key MoR | Aggregate Key | -|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| -| 导入速度 | 导入过程中进行数据去重,小批量实时写入相比MoR约有10%-20%的性能损失,大批量导入(例如千万级/亿级数据)相比MoR约有30%-50%的性能损失 | 与Duplicate Key接近 | 与Duplicate Key接近 | -| 查询速度 | 与Duplicate Key接近 | 需要在查询期间进行去重,查询耗时约为 MoW 的3-10倍 | 如果聚合函数为REPLACE/REPLACE_IF_NOT_NULL,查询速度与MoR接近 | -| 谓词下推 | 支持 | 不支持 | 不支持 | -| 资源消耗 | - **导入资源消耗**:相比Duplicate Key/Unique Key MoR,约额外消耗约10%-30%的CPU。<br /> - **查询资源消耗**:与Duplicate Key接近,无额外资源消耗。<br /> - **Compaction资源消耗**:相比Duplicate Key,消耗更多内存和CPU,具体取决于数据特征和数据量。 | - **导入资源消耗**:与Duplicate Key相近,无额外资源消耗。<br /> - **查询资源消耗**:相比Duplicate Key/Unique Key MoW,查询时额外消耗更多的CPU和内存。<br /> - **Compaction资源消耗**:相比Duplicate Key,需更多内存和CPU,具体数值取决于数据特征和数据量。 | 与Unique Key MoR相同 | +聚合模型只支持基于导入方式的更新,不支持使用 Update 语句更新。在定义聚合模型表的时候,如果把 value 列的聚合函数定义为 REPLACE_IF_NOT_NULL,也可以间接实现类似主键表的部分列更新能力。更多内容,请查看 [聚合模型的导入更新](../update/update-of-aggregate-model)。 -### 功能支持对比 -| | Unique Key MoW | Unique Key MoR | Aggregate Key | -|----------------|----------------|----------------|----------------| -| UPDATE |支持|支持| 不支持 | -| DELETE |支持|支持| 不支持 | -| sequence列 |支持|支持| 不支持 | -| delete_sign |支持|支持| 不支持 | -| 部分列更新 |支持|不支持| 支持(但无法更新null值) | -| 倒排索引 |支持|不支持| 不支持 | +## 主键模型和聚合模型的选择建议 +- 大部分有数据更新需求的场景,都建议**首选主键模型**。例如从 TP 数据库 CDC 同步到 Doris,用户画像,人群圈选等。 +- 下面两类场景,建议使用聚合模型: + 1. 部分字段需要做指标聚合,部分字段需要进行更新。 + 2. 对部分列更新有需求,同时对写入性能非常敏感,对查询延迟要求不高的场景,建议使用聚合表 + REPLACE_IF_NOT_NULL 聚合函数。 \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/data-operate/update/update-overview.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/data-operate/update/update-overview.md index 7a21d4ffcef..c097f4c34aa 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/data-operate/update/update-overview.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/data-operate/update/update-overview.md @@ -24,7 +24,33 @@ specific language governing permissions and limitations under the License. --> -数据更新,主要指针对相同 Key 的数据 Value 列的值的更新,这个更新对于主键模型来说,就是替换,对于聚合模型来说,就是如何完成针对 value 列上的聚合。 +数据更新是指对具有相同 key 的数据记录中的 value 列进行修改。对于不同的数据模型,数据更新的处理方式有所不同: + +- **主键(Unique)模型**:主键模型是专门为数据更新设计的一种数据模型。Doris 支持两种存储方式:Merge-on-Read(MoR)和 Merge-on-Write(MoW)。MoR 优化了写入性能,而 MoW 则提供了更好的分析性能。在 Doris 2.0 版本中,默认存储方式为 MoR。主键模型支持使用 `UPDATE` 语句进行少量数据更新,也支持通过导入方式进行批量更新。导入方式包括 Stream Load、Broker Load、Routine Load 和 Insert Into 等,所有导入操作都遵循“UPSERT”语义,即如果记录不存在则插入,存在则更新。更新操作支持整行更新和部分列更新,默认为整行更新。 + +- **聚合(Aggregate)模型**:在聚合模型中,数据更新是一种特殊用法。当聚合函数设置为 REPLACE 或 REPLACE_IF_NOT_NULL 时,可以实现数据更新。聚合模型仅支持基于导入方式的更新,不支持使用 `UPDATE` 语句。通过设置聚合函数为 REPLACE_IF_NOT_NULL,可以实现部分列更新的能力。 + +通过对不同模型的数据更新方式的理解,可以更好地选择适合的更新策略,以满足具体的业务需求。 + +## 不同模型/实现的更新能力对比 + +### 性能对比 +| | Unique Key MoW | Unique Key MoR | Aggregate Key | +|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| +| 导入速度 | 导入过程中进行数据去重,小批量实时写入相比 MoR 约有 10%-20% 的性能损失,大批量导入(例如千万级/亿级数据)相比 MoR 约有 30%-50% 的性能损失 | 与 Duplicate Key 接近 | 与 Duplicate Key 接近 | +| 查询速度 | 与 Duplicate Key 接近 | 需要在查询期间进行去重,查询耗时约为 MoW 的 3-10 倍 | 如果聚合函数为 REPLACE/REPLACE_IF_NOT_NULL,查询速度与 MoR 接近 | +| 谓词下推 | 支持 | 不支持 | 不支持 | +| 资源消耗 | - **导入资源消耗**:相比 Duplicate Key/Unique Key MoR,约额外消耗 10%-30% 的 CPU。<br /> - **查询资源消耗**:与 Duplicate Key 接近,无额外资源消耗。<br /> - **Compaction 资源消耗**:相比 Duplicate Key,消耗更多内存和 CPU,具体取决于数据特征和数据量。 | - **导入资源消耗**:与 Duplicate Key 相近,无额外资源消耗。<br /> - **查询资源消耗**:相比 Duplicate Key/Unique Key MoW,查询时额外消耗更多的 CPU 和内存。<br /> - **Compaction 资源消耗**:相比 Duplicate Key,需更多内存和 CPU,具体数值取决于数据特征和数据量。 | 与 Unique Key MoR 相同 | + +### 功能支持对比 +| | Unique Key MoW | Unique Key MoR | Aggregate Key | +|----------------|----------------|----------------|----------------| +| UPDATE | 支持 | 支持 | 不支持 | +| DELETE | 支持 | 支持 | 不支持 | +| sequence 列 | 支持 | 支持 | 不支持 | +| delete_sign | 支持 | 支持 | 不支持 | +| 部分列更新 | 支持 | 不支持 | 支持(但无法更新 null 值) | +| 倒排索引 | 支持 | 不支持 | 不支持 | ## 主键(Unique)模型的更新 @@ -35,14 +61,14 @@ Doris 主键 (unique) 模型,从 Doris 2.0 开始,除了原来的 Merge-on-R ```sql CREATE TABLE IF NOT EXISTS example_tbl_unique_merge_on_write ( - `user_id` LARGEINT NOT NULL, - `username` VARCHAR(50) NOT NULL , - `city` VARCHAR(20), - `age` SMALLINT, - `sex` TINYINT, - `phone` LARGEINT, - `address` VARCHAR(500), - `register_time` DATETIME + `user_id` LARGEINT NOT NULL, + `username` VARCHAR(50) NOT NULL , + `city` VARCHAR(20), + `age` SMALLINT, + `sex` TINYINT, + `phone` LARGEINT, + `address` VARCHAR(500), + `register_time` DATETIME ) UNIQUE KEY(`user_id`, `username`) DISTRIBUTED BY HASH(`user_id`) BUCKETS 1 @@ -58,9 +84,9 @@ PROPERTIES ( ### 主键模型的两种更新方式 -#### 使用`UPDATE`语句更新 +#### 使用 `UPDATE` 语句更新 -无论是 MoR 还是 MoW,语义都是完成对指定列的更新。单次UPDATE的耗时会随着被更新的数据量的增加而增长。 +无论是 MoR 还是 MoW,语义都是完成对指定列的更新。单次 UPDATE 的耗时会随着被更新的数据量的增加而增长。 #### 基于导入的批量更新 @@ -72,13 +98,13 @@ Doris 支持多种数据导入方式,包括 Stream Load、Broker Load、Routin 我们会分别在文档 [主键模型的 Update 更新](../update/unique-update) 和 [主键模型的导入更新](../update/update-of-unique-model) 详细介绍两种更新方式。 -### 主键模型的更新事务 +### 主键模型的更新并发控制 -#### 使用`UPDATE`语句更新数据 +#### 使用 `UPDATE` 语句更新数据 -默认情况下,Doris 不允许在同一时间对同一张表进行多个`UPDATE`操作。`UPDATE`语句通过表级锁来确保事务的一致性。 +默认情况下,Doris 不允许在同一时间对同一张表进行多个 `UPDATE` 操作。`UPDATE` 语句通过表级锁来确保隔离性。 -用户可以通过修改 FE 配置`enable_concurrent_update=true`来调整并发限制。当放宽并发限制时,`UPDATE`语句将不再提供事务保证。 +用户可以通过修改 FE 配置 `enable_concurrent_update=true` 来调整并发限制。当放宽并发限制时,多个 `UPDATE` 语句如果更新同一行数据,结果将是未定义的。 #### 基于导入的批量更新 @@ -88,32 +114,18 @@ Doris 对所有导入更新操作提供原子性保障,即每次导入数据 由于多个并发导入更新的提交顺序可能无法预期,若这些并发导入涉及相同主键的更新,则其生效顺序也无法预知,最终的可见结果会因此存在不确定性。为解决此问题,Doris 提供了 sequence 列机制,允许用户在并发导入更新时为每一行数据指定版本,以便明确控制并发更新的结果顺序,实现确定性。 -我们将在文档[主键模型的更新事务](../update/unique-update-transaction.md) 中对事务机制进行详细介绍 +我们将在文档 [主键模型的更新事务](../update/unique-update-transaction.md) 中对事务机制进行详细介绍。 ## 聚合(Aggregate)模型的更新 -聚合模型的更新,主要是指的是用新的列值和旧的聚合值按照聚合函数的要求产出新的聚合值。 +聚合模型的更新,主要是指用新的列值和旧的聚合值按照聚合函数的要求产出新的聚合值。 -New Agg Value = Agg Func ( Old Agg Value, New Column Value) +New Agg Value = Agg Func (Old Agg Value, New Column Value) -聚合模型只支持基于导入方式的更新,不支持使用 Update 语句更新。在定义聚合模型表的时候,如果把 value 列的聚合函数定义为 REPLACE_IF_NULL,也可以间接实现类似主键表的部分列更新能力。更多内容,请查看 [聚合模型的导入更新](../update/update-of-aggregate-model)。 - -## 不同模型/实现的更新能力对比 +聚合模型只支持基于导入方式的更新,不支持使用 Update 语句更新。在定义聚合模型表的时候,如果把 value 列的聚合函数定义为 REPLACE_IF_NOT_NULL,也可以间接实现类似主键表的部分列更新能力。更多内容,请查看 [聚合模型的导入更新](../update/update-of-aggregate-model)。 -### 性能对比 -| | Unique Key MoW | Unique Key MoR | Aggregate Key | -|----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|---------------| -| 导入速度 | 导入过程中进行数据去重,小批量实时写入相比MoR约有10%-20%的性能损失,大批量导入(例如千万级/亿级数据)相比MoR约有30%-50%的性能损失 | 与Duplicate Key接近 | 与Duplicate Key接近 | -| 查询速度 | 与Duplicate Key接近 | 需要在查询期间进行去重,查询耗时约为 MoW 的3-10倍 | 如果聚合函数为REPLACE/REPLACE_IF_NOT_NULL,查询速度与MoR接近 | -| 谓词下推 | 支持 | 不支持 | 不支持 | -| 资源消耗 | - **导入资源消耗**:相比Duplicate Key/Unique Key MoR,约额外消耗约10%-30%的CPU。<br /> - **查询资源消耗**:与Duplicate Key接近,无额外资源消耗。<br /> - **Compaction资源消耗**:相比Duplicate Key,消耗更多内存和CPU,具体取决于数据特征和数据量。 | - **导入资源消耗**:与Duplicate Key相近,无额外资源消耗。<br /> - **查询资源消耗**:相比Duplicate Key/Unique Key MoW,查询时额外消耗更多的CPU和内存。<br /> - **Compaction资源消耗**:相比Duplicate Key,需更多内存和CPU,具体数值取决于数据特征和数据量。 | 与Unique Key MoR相同 | - -### 功能支持对比 -| | Unique Key MoW | Unique Key MoR | Aggregate Key | -|----------------|----------------|----------------|---------------| -| UPDATE |支持|支持|不支持| -| DELETE |支持|支持|不支持| -| sequence列 |支持|支持|不支持| -| delete_sign |支持|支持|不支持| -| 部分列更新 |支持|不支持| 支持(但无法更新null值) | -| 倒排索引 |支持|不支持|不支持| +## 主键模型和聚合模型的选择建议 +- 大部分有数据更新需求的场景,都建议**首选主键模型**。例如从 TP 数据库 CDC 同步到 Doris,用户画像,人群圈选等。 +- 下面两类场景,建议使用聚合模型 + 1. 部分字段需要做指标聚合,部分字段需要进行更新 + 2. 对部分列更新有需求,同时对写入性能非常敏感,对查询延迟要求不高的场景,建议使用聚合表 + REPLACE_IF_NOT_NULL 聚合函数 \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/update/update-overview.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/update/update-overview.md index 7a21d4ffcef..a57ec68fb74 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/update/update-overview.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/update/update-overview.md @@ -24,25 +24,51 @@ specific language governing permissions and limitations under the License. --> -数据更新,主要指针对相同 Key 的数据 Value 列的值的更新,这个更新对于主键模型来说,就是替换,对于聚合模型来说,就是如何完成针对 value 列上的聚合。 +数据更新是指对具有相同 key 的数据记录中的 value 列进行修改。对于不同的数据模型,数据更新的处理方式有所不同: + +- **主键(Unique)模型**:主键模型是专门为数据更新设计的一种数据模型。Doris 支持两种存储方式:Merge-on-Read(MoR)和 Merge-on-Write(MoW)。MoR 优化了写入性能,而 MoW 则提供了更好的分析性能。从 Doris 2.1 版本开始,默认存储方式为 MoW。主键模型支持使用 `UPDATE` 语句进行少量数据更新,也支持通过导入方式进行批量更新。导入方式包括 Stream Load、Broker Load、Routine Load 和 Insert Into 等,所有导入操作都遵循“UPSERT”语义,即如果记录不存在则插入,存在则更新。更新操作支持整行更新和部分列更新,默认为整行更新。 + +- **聚合(Aggregate)模型**:在聚合模型中,数据更新是一种特殊用法。当聚合函数设置为 REPLACE 或 REPLACE_IF_NOT_NULL 时,可以实现数据更新。聚合模型仅支持基于导入方式的更新,不支持使用 `UPDATE` 语句。通过设置聚合函数为 REPLACE_IF_NOT_NULL,可以实现部分列更新的能力。 + +通过对不同模型的数据更新方式的理解,可以更好地选择适合的更新策略,以满足具体的业务需求。 + +## 不同模型/实现的更新能力对比 + +### 性能对比 +| | Unique Key MoW | Unique Key MoR | Aggregate Key | +|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| +| 导入速度 | 导入过程中进行数据去重,小批量实时写入相比 MoR 约有 10%-20% 的性能损失,大批量导入(例如千万级/亿级数据)相比 MoR 约有 30%-50% 的性能损失 | 与 Duplicate Key 接近 | 与 Duplicate Key 接近 | +| 查询速度 | 与 Duplicate Key 接近 | 需要在查询期间进行去重,查询耗时约为 MoW 的 3-10 倍 | 如果聚合函数为 REPLACE/REPLACE_IF_NOT_NULL,查询速度与 MoR 接近 | +| 谓词下推 | 支持 | 不支持 | 不支持 | +| 资源消耗 | - **导入资源消耗**:相比 Duplicate Key/Unique Key MoR,约额外消耗 10%-30% 的 CPU。<br /> - **查询资源消耗**:与 Duplicate Key 接近,无额外资源消耗。<br /> - **Compaction 资源消耗**:相比 Duplicate Key,消耗更多内存和 CPU,具体取决于数据特征和数据量。 | - **导入资源消耗**:与 Duplicate Key 相近,无额外资源消耗。<br /> - **查询资源消耗**:相比 Duplicate Key/Unique Key MoW,查询时额外消耗更多的 CPU 和内存。<br /> - **Compaction 资源消耗**:相比 Duplicate Key,需更多内存和 CPU,具体数值取决于数据特征和数据量。 | 与 Unique Key MoR 相同 | + +### 功能支持对比 +| | Unique Key MoW | Unique Key MoR | Aggregate Key | +|----------------|----------------|----------------|----------------| +| UPDATE | 支持 | 支持 | 不支持 | +| DELETE | 支持 | 支持 | 不支持 | +| sequence 列 | 支持 | 支持 | 不支持 | +| delete_sign | 支持 | 支持 | 不支持 | +| 部分列更新 | 支持 | 不支持 | 支持(但无法更新 null 值) | +| 倒排索引 | 支持 | 不支持 | 不支持 | ## 主键(Unique)模型的更新 -Doris 主键 (unique) 模型,从 Doris 2.0 开始,除了原来的 Merge-on-Read(MoR),也引入了 Merge-on-Write(MoW)的存储方式,MoR 是为了写入做优化,而 MoW 是为了更快的分析性能做优化。在实际测试中,MoW 存储方式的典型表,分析性能可以是 MoR 方式的 5-10 倍。 +Doris 主键(unique)模型,从 Doris 2.0 开始,除了原来的 Merge-on-Read(MoR),也引入了 Merge-on-Write(MoW)的存储方式,MoR 是为了写入做优化,而 MoW 是为了更快的分析性能做优化。在实际测试中,MoW 存储方式的典型表,分析性能可以是 MoR 方式的 5-10 倍。 在 Doris 2.0,默认创建的 unique 模型依旧是 MoR 的,如果要创建 MoW 的,需要通过参数 "enable_unique_key_merge_on_write" = "true" 手动指定,如下示例: ```sql CREATE TABLE IF NOT EXISTS example_tbl_unique_merge_on_write ( - `user_id` LARGEINT NOT NULL, - `username` VARCHAR(50) NOT NULL , - `city` VARCHAR(20), - `age` SMALLINT, - `sex` TINYINT, - `phone` LARGEINT, - `address` VARCHAR(500), - `register_time` DATETIME + `user_id` LARGEINT NOT NULL, + `username` VARCHAR(50) NOT NULL, + `city` VARCHAR(20), + `age` SMALLINT, + `sex` TINYINT, + `phone` LARGEINT, + `address` VARCHAR(500), + `register_time` DATETIME ) UNIQUE KEY(`user_id`, `username`) DISTRIBUTED BY HASH(`user_id`) BUCKETS 1 @@ -58,9 +84,9 @@ PROPERTIES ( ### 主键模型的两种更新方式 -#### 使用`UPDATE`语句更新 +#### 使用 `UPDATE` 语句更新 -无论是 MoR 还是 MoW,语义都是完成对指定列的更新。单次UPDATE的耗时会随着被更新的数据量的增加而增长。 +无论是 MoR 还是 MoW,语义都是完成对指定列的更新。单次 UPDATE 的耗时会随着被更新的数据量的增加而增长。 #### 基于导入的批量更新 @@ -72,13 +98,13 @@ Doris 支持多种数据导入方式,包括 Stream Load、Broker Load、Routin 我们会分别在文档 [主键模型的 Update 更新](../update/unique-update) 和 [主键模型的导入更新](../update/update-of-unique-model) 详细介绍两种更新方式。 -### 主键模型的更新事务 +### 主键模型的更新并发控制 -#### 使用`UPDATE`语句更新数据 +#### 使用 `UPDATE` 语句更新数据 -默认情况下,Doris 不允许在同一时间对同一张表进行多个`UPDATE`操作。`UPDATE`语句通过表级锁来确保事务的一致性。 +默认情况下,Doris 不允许在同一时间对同一张表进行多个 `UPDATE` 操作。`UPDATE` 语句通过表级锁来确保隔离性。 -用户可以通过修改 FE 配置`enable_concurrent_update=true`来调整并发限制。当放宽并发限制时,`UPDATE`语句将不再提供事务保证。 +用户可以通过修改 FE 配置 `enable_concurrent_update=true` 来调整并发限制。当放宽并发限制时,多个 `UPDATE` 语句如果更新同一行数据,结果将是未定义的。 #### 基于导入的批量更新 @@ -88,32 +114,18 @@ Doris 对所有导入更新操作提供原子性保障,即每次导入数据 由于多个并发导入更新的提交顺序可能无法预期,若这些并发导入涉及相同主键的更新,则其生效顺序也无法预知,最终的可见结果会因此存在不确定性。为解决此问题,Doris 提供了 sequence 列机制,允许用户在并发导入更新时为每一行数据指定版本,以便明确控制并发更新的结果顺序,实现确定性。 -我们将在文档[主键模型的更新事务](../update/unique-update-transaction.md) 中对事务机制进行详细介绍 +我们将在文档 [主键模型的更新事务](../update/unique-update-transaction.md) 中对事务机制进行详细介绍。 ## 聚合(Aggregate)模型的更新 -聚合模型的更新,主要是指的是用新的列值和旧的聚合值按照聚合函数的要求产出新的聚合值。 +聚合模型的更新,主要是指用新的列值和旧的聚合值按照聚合函数的要求产出新的聚合值。 -New Agg Value = Agg Func ( Old Agg Value, New Column Value) +New Agg Value = Agg Func (Old Agg Value, New Column Value) -聚合模型只支持基于导入方式的更新,不支持使用 Update 语句更新。在定义聚合模型表的时候,如果把 value 列的聚合函数定义为 REPLACE_IF_NULL,也可以间接实现类似主键表的部分列更新能力。更多内容,请查看 [聚合模型的导入更新](../update/update-of-aggregate-model)。 - -## 不同模型/实现的更新能力对比 +聚合模型只支持基于导入方式的更新,不支持使用 `UPDATE` 语句更新。在定义聚合模型表的时候,如果把 value 列的聚合函数定义为 REPLACE_IF_NOT_NULL,也可以间接实现类似主键表的部分列更新能力。更多内容,请查看 [聚合模型的导入更新](../update/update-of-aggregate-model)。 -### 性能对比 -| | Unique Key MoW | Unique Key MoR | Aggregate Key | -|----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|---------------| -| 导入速度 | 导入过程中进行数据去重,小批量实时写入相比MoR约有10%-20%的性能损失,大批量导入(例如千万级/亿级数据)相比MoR约有30%-50%的性能损失 | 与Duplicate Key接近 | 与Duplicate Key接近 | -| 查询速度 | 与Duplicate Key接近 | 需要在查询期间进行去重,查询耗时约为 MoW 的3-10倍 | 如果聚合函数为REPLACE/REPLACE_IF_NOT_NULL,查询速度与MoR接近 | -| 谓词下推 | 支持 | 不支持 | 不支持 | -| 资源消耗 | - **导入资源消耗**:相比Duplicate Key/Unique Key MoR,约额外消耗约10%-30%的CPU。<br /> - **查询资源消耗**:与Duplicate Key接近,无额外资源消耗。<br /> - **Compaction资源消耗**:相比Duplicate Key,消耗更多内存和CPU,具体取决于数据特征和数据量。 | - **导入资源消耗**:与Duplicate Key相近,无额外资源消耗。<br /> - **查询资源消耗**:相比Duplicate Key/Unique Key MoW,查询时额外消耗更多的CPU和内存。<br /> - **Compaction资源消耗**:相比Duplicate Key,需更多内存和CPU,具体数值取决于数据特征和数据量。 | 与Unique Key MoR相同 | - -### 功能支持对比 -| | Unique Key MoW | Unique Key MoR | Aggregate Key | -|----------------|----------------|----------------|---------------| -| UPDATE |支持|支持|不支持| -| DELETE |支持|支持|不支持| -| sequence列 |支持|支持|不支持| -| delete_sign |支持|支持|不支持| -| 部分列更新 |支持|不支持| 支持(但无法更新null值) | -| 倒排索引 |支持|不支持|不支持| +## 主键模型和聚合模型的选择建议 +- 大部分有数据更新需求的场景,都建议首选**主键模型**。例如从 TP 数据库 CDC 同步到 Doris,用户画像,人群圈选等。 +- 下面两类场景,建议使用聚合模型: + 1. 部分字段需要做指标聚合,部分字段需要进行更新。 + 2. 对部分列更新有需求,同时对写入性能非常敏感,对查询延迟要求不高的场景,建议使用聚合表 + REPLACE_IF_NOT_NULL 聚合函数。 \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/update/update-overview.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/update/update-overview.md index 9591eba3313..e8020202167 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/update/update-overview.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/update/update-overview.md @@ -24,25 +24,51 @@ specific language governing permissions and limitations under the License. --> -数据更新,主要指针对相同 Key 的数据 Value 列的值的更新,这个更新对于主键模型来说,就是替换,对于聚合模型来说,就是如何完成针对 value 列上的聚合。 +数据更新是指对具有相同 key 的数据记录中的 value 列进行修改。对于不同的数据模型,数据更新的处理方式有所不同: + +- **主键(Unique)模型**:主键模型是专门为数据更新设计的一种数据模型。Doris 支持两种存储方式:Merge-on-Read(MoR)和 Merge-on-Write(MoW)。MoR 优化了写入性能,而 MoW 则提供了更好的分析性能。从 Doris 2.1 版本开始,默认存储方式为 MoW。主键模型支持使用 `UPDATE` 语句进行少量数据更新,也支持通过导入方式进行批量更新。导入方式包括 Stream Load、Broker Load、Routine Load 和 Insert Into 等,所有导入操作都遵循“UPSERT”语义,即如果记录不存在则插入,存在则更新。更新操作支持整行更新和部分列更新,默认为整行更新。 + +- **聚合(Aggregate)模型**:在聚合模型中,数据更新是一种特殊用法。当聚合函数设置为 REPLACE 或 REPLACE_IF_NOT_NULL 时,可以实现数据更新。聚合模型仅支持基于导入方式的更新,不支持使用 `UPDATE` 语句。通过设置聚合函数为 REPLACE_IF_NOT_NULL,可以实现部分列更新的能力。 + +通过对不同模型的数据更新方式的理解,可以更好地选择适合的更新策略,以满足具体的业务需求。 + +## 不同模型/实现的更新能力对比 + +### 性能对比 +| | Unique Key MoW | Unique Key MoR | Aggregate Key | +|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| +| 导入速度 | 导入过程中进行数据去重,小批量实时写入相比 MoR 约有 10%-20% 的性能损失,大批量导入(例如千万级/亿级数据)相比 MoR 约有 30%-50% 的性能损失 | 与 Duplicate Key 接近 | 与 Duplicate Key 接近 | +| 查询速度 | 与 Duplicate Key 接近 | 需要在查询期间进行去重,查询耗时约为 MoW 的 3-10 倍 | 如果聚合函数为 REPLACE/REPLACE_IF_NOT_NULL,查询速度与 MoR 接近 | +| 谓词下推 | 支持 | 不支持 | 不支持 | +| 资源消耗 | - **导入资源消耗**:相比 Duplicate Key/Unique Key MoR,约额外消耗 10%-30% 的 CPU。<br /> - **查询资源消耗**:与 Duplicate Key 接近,无额外资源消耗。<br /> - **Compaction 资源消耗**:相比 Duplicate Key,消耗更多内存和 CPU,具体取决于数据特征和数据量。 | - **导入资源消耗**:与 Duplicate Key 相近,无额外资源消耗。<br /> - **查询资源消耗**:相比 Duplicate Key/Unique Key MoW,查询时额外消耗更多的 CPU 和内存。<br /> - **Compaction 资源消耗**:相比 Duplicate Key,需更多内存和 CPU,具体数值取决于数据特征和数据量。 | 与 Unique Key MoR 相同 | + +### 功能支持对比 +| | Unique Key MoW | Unique Key MoR | Aggregate Key | +|----------------|----------------|----------------|----------------| +| UPDATE | 支持 | 支持 | 不支持 | +| DELETE | 支持 | 支持 | 不支持 | +| sequence 列 | 支持 | 支持 | 不支持 | +| delete_sign | 支持 | 支持 | 不支持 | +| 部分列更新 | 支持 | 不支持 | 支持(但无法更新 null 值) | +| 倒排索引 | 支持 | 不支持 | 不支持 | ## 主键(Unique)模型的更新 -Doris 主键 (unique) 模型,从 Doris 2.0 开始,除了原来的 Merge-on-Read(MoR),也引入了 Merge-on-Write(MoW)的存储方式,MoR 是为了写入做优化,而 MoW 是为了更快的分析性能做优化。在实际测试中,MoW 存储方式的典型表,分析性能可以是 MoR 方式的 5-10 倍。 +Doris 主键(unique)模型,从 Doris 2.0 开始,除了原来的 Merge-on-Read(MoR),也引入了 Merge-on-Write(MoW)的存储方式,MoR 是为了写入做优化,而 MoW 是为了更快的分析性能做优化。在实际测试中,MoW 存储方式的典型表,分析性能可以是 MoR 方式的 5-10 倍。 在 Doris 2.0,默认创建的 unique 模型依旧是 MoR 的,如果要创建 MoW 的,需要通过参数 "enable_unique_key_merge_on_write" = "true" 手动指定,如下示例: ```sql CREATE TABLE IF NOT EXISTS example_tbl_unique_merge_on_write ( - `user_id` LARGEINT NOT NULL, - `username` VARCHAR(50) NOT NULL , - `city` VARCHAR(20), - `age` SMALLINT, - `sex` TINYINT, - `phone` LARGEINT, - `address` VARCHAR(500), - `register_time` DATETIME + `user_id` LARGEINT NOT NULL, + `username` VARCHAR(50) NOT NULL , + `city` VARCHAR(20), + `age` SMALLINT, + `sex` TINYINT, + `phone` LARGEINT, + `address` VARCHAR(500), + `register_time` DATETIME ) UNIQUE KEY(`user_id`, `username`) DISTRIBUTED BY HASH(`user_id`) BUCKETS 1 @@ -58,7 +84,7 @@ PROPERTIES ( ### 主键模型的两种更新方式 -#### 使用`UPDATE`语句更新 +#### 使用 `UPDATE` 语句更新 无论是 MoR 还是 MoW,语义都是完成对指定列的更新。这个适合少量数据,不频繁的更新。 @@ -72,13 +98,13 @@ Doris 支持多种数据导入方式,包括 Stream Load、Broker Load、Routin 我们会分别在文档 [主键模型的 Update 更新](../update/unique-update) 和 [主键模型的导入更新](../update/update-of-unique-model) 详细介绍两种更新方式。 -### 主键模型的更新事务 +### 主键模型的更新并发控制 -#### 使用`UPDATE`语句更新数据 +#### 使用 `UPDATE` 语句更新数据 -默认情况下,Doris 不允许在同一时间对同一张表进行多个`UPDATE`操作。`UPDATE`语句通过表级锁来确保事务的一致性。 +默认情况下,Doris 不允许在同一时间对同一张表进行多个 `UPDATE` 操作。`UPDATE` 语句通过表级锁来确保隔离性。 -用户可以通过修改 FE 配置`enable_concurrent_update=true`来调整并发限制。当放宽并发限制时,`UPDATE`语句将不再提供事务保证。 +用户可以通过修改 FE 配置 `enable_concurrent_update=true` 来调整并发限制。当放宽并发限制时,多个 `UPDATE` 语句如果更新同一行数据,结果将是未定义的。 #### 基于导入的批量更新 @@ -88,32 +114,18 @@ Doris 对所有导入更新操作提供原子性保障,即每次导入数据 由于多个并发导入更新的提交顺序可能无法预期,若这些并发导入涉及相同主键的更新,则其生效顺序也无法预知,最终的可见结果会因此存在不确定性。为解决此问题,Doris 提供了 sequence 列机制,允许用户在并发导入更新时为每一行数据指定版本,以便明确控制并发更新的结果顺序,实现确定性。 -我们将在文档[主键模型的更新事务](../update/unique-update-transaction.md) 中对事务机制进行详细介绍 +我们将在文档[主键模型的更新事务](../update/unique-update-transaction.md) 中对事务机制进行详细介绍。 ## 聚合(Aggregate)模型的更新 -聚合模型的更新,主要是指的是用新的列值和旧的聚合值按照聚合函数的要求产出新的聚合值。 +聚合模型的更新,主要是指用新的列值和旧的聚合值按照聚合函数的要求产出新的聚合值。 -New Agg Value = Agg Func ( Old Agg Value, New Column Value) +New Agg Value = Agg Func (Old Agg Value, New Column Value) -聚合模型只支持基于导入方式的更新,不支持使用 Update 语句更新。在定义聚合模型表的时候,如果把 value 列的聚合函数定义为 REPLACE_IF_NULL,也可以间接实现类似主键表的部分列更新能力。更多内容,请查看 [聚合模型的导入更新](../update/update-of-aggregate-model)。 - -## 不同模型/实现的更新能力对比 +聚合模型只支持基于导入方式的更新,不支持使用 Update 语句更新。在定义聚合模型表的时候,如果把 value 列的聚合函数定义为 REPLACE_IF_NOT_NULL,也可以间接实现类似主键表的部分列更新能力。更多内容,请查看 [聚合模型的导入更新](../update/update-of-aggregate-model)。 -### 性能对比 -| | Unique Key MoW | Unique Key MoR | Aggregate Key | -|----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|---------------| -| 导入速度 | 导入过程中进行数据去重,小批量实时写入相比MoR约有10%-20%的性能损失,大批量导入(例如千万级/亿级数据)相比MoR约有30%-50%的性能损失 | 与Duplicate Key接近 | 与Duplicate Key接近 | -| 查询速度 | 与Duplicate Key接近 | 需要在查询期间进行去重,查询耗时约为 MoW 的3-10倍 | 如果聚合函数为REPLACE/REPLACE_IF_NOT_NULL,查询速度与MoR接近 | -| 谓词下推 | 支持 | 不支持 | 不支持 | -| 资源消耗 | - **导入资源消耗**:相比Duplicate Key/Unique Key MoR,约额外消耗约10%-30%的CPU。<br /> - **查询资源消耗**:与Duplicate Key接近,无额外资源消耗。<br /> - **Compaction资源消耗**:相比Duplicate Key,消耗更多内存和CPU,具体取决于数据特征和数据量。 | - **导入资源消耗**:与Duplicate Key相近,无额外资源消耗。<br /> - **查询资源消耗**:相比Duplicate Key/Unique Key MoW,查询时额外消耗更多的CPU和内存。<br /> - **Compaction资源消耗**:相比Duplicate Key,需更多内存和CPU,具体数值取决于数据特征和数据量。 | 与Unique Key MoR相同 | - -### 功能支持对比 -| | Unique Key MoW | Unique Key MoR | Aggregate Key | -|----------------|----------------|----------------|---------------| -| UPDATE |支持|支持|不支持| -| DELETE |支持|支持|不支持| -| sequence列 |支持|支持|不支持| -| delete_sign |支持|支持|不支持| -| 部分列更新 |支持|不支持| 支持(但无法更新null值) | -| 倒排索引 |支持|不支持|不支持| +## 主键模型和聚合模型的选择建议 +- 大部分有数据更新需求的场景,都建议首选**主键模型**。例如从TP数据库CDC同步到Doris,用户画像,人群圈选等。 +- 下面两类场景,建议使用聚合模型: + 1. 部分字段需要做指标聚合,部分字段需要进行更新。 + 2. 对部分列更新有需求,同时对写入性能非常敏感,对查询延迟要求不高的场景,建议使用聚合表 + REPLACE_IF_NOT_NULL聚合函数。 \ No newline at end of file diff --git a/versioned_docs/version-2.0/data-operate/update/update-overview.md b/versioned_docs/version-2.0/data-operate/update/update-overview.md index a6fd54f2856..9c82424d661 100644 --- a/versioned_docs/version-2.0/data-operate/update/update-overview.md +++ b/versioned_docs/version-2.0/data-operate/update/update-overview.md @@ -24,7 +24,33 @@ specific language governing permissions and limitations under the License. --> -Data update primarily refers to the modification of the value column of data with the same key. For the primary(unique) key model, this update involves replacing the existing value, while for the aggregate model, it involves aggregating the values in the value column. +Data update refers to modifying the value columns in data records with the same key. The handling of data updates varies for different data models: + +- **Primary Key (Unique) Model**: The primary key model is specifically designed for data updates. Doris supports two storage implementations: Merge-on-Read (MoR) and Merge-on-Write (MoW). MoR optimizes write performance, while MoW provides better analytical performance. In Doris version 2.0, the default storage method is MoR. The primary key model supports using the `UPDATE` statement for small data updates and also supports batch updates through data loading. Loading methods include St [...] + +- **Aggregate Model**: In the aggregate model, data update is a special use case. When the aggregate function is set to REPLACE or REPLACE_IF_NOT_NULL, data updates can be achieved. The aggregate model only supports updates based on data loading and does not support using the `UPDATE` statement. By setting the aggregate function to REPLACE_IF_NOT_NOT_NULL, partial column update capability can be achieved. + +By understanding the data update methods of different models, you can better choose the appropriate update strategy to meet specific business needs. + +## Comparison of Update Capabilities for Different Models/Implementations + +### Performance Comparison +| | Unique Key MoW | Unique Key MoR [...] +|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------ [...] +| Import Speed | Deduplication is performed during import. Small-batch real-time writes incur approximately 10%-20% performance loss compared to MoR, while large-batch imports (e.g., tens or hundreds of millions of records) have about 30%-50% performance loss compared to MoR. | Similar to Duplicate Key [...] +| Query Speed | Similar to Duplicate Key | Requires deduplication during queries, with query time approximately 3-10 times that of M [...] +| Predicate Pushdown | Supported | Not Supported [...] +| Resource Consumption | - **Import Resource Consumption**: Consumes approximately 10%-30% more CPU compared to Duplicate Key/Unique Key MoR.<br /> - **Query Resource Consumption**: Similar to Duplicate Key with no additional resource consumption.<br /> - **Compaction Resource Consumption**: Higher memory and CPU usage compared to Duplicate Key, specific usage depends on data characteristics and volume. | - **Import Resource Consumption**: Similar to Duplicate Key with no additional reso [...] + +### Feature Support Comparison +| | Unique Key MoW | Unique Key MoR | Aggregate Key | +|----------------|----------------|----------------|------------------------------------| +| UPDATE | Supported | Supported | Not Supported | +| DELETE | Supported | Supported | Not Supported | +| sequence column| Supported | Supported | Not Supported | +| delete_sign | Supported | Supported | Not Supported | +| Partial Column Updates | Supported | Not Supported | Supported (can't update null value) | +| Inverted Index | Supported | Not Supported | Not Supported | ## Update in Primary Key (Unique) Model @@ -35,14 +61,14 @@ By default, in Doris 2.0, the unique key model is still based on MoR. To create ```sql CREATE TABLE IF NOT EXISTS example_tbl_unique_merge_on_write ( - `user_id` LARGEINT NOT NULL, - `username` VARCHAR(50) NOT NULL , - `city` VARCHAR(20), - `age` SMALLINT, - `sex` TINYINT, - `phone` LARGEINT, - `address` VARCHAR(500), - `register_time` DATETIME + `user_id` LARGEINT NOT NULL, + `username` VARCHAR(50) NOT NULL , + `city` VARCHAR(20), + `age` SMALLINT, + `sex` TINYINT, + `phone` LARGEINT, + `address` VARCHAR(500), + `register_time` DATETIME ) UNIQUE KEY(`user_id`, `username`) DISTRIBUTED BY HASH(`user_id`) BUCKETS 1 @@ -72,13 +98,13 @@ Doris supports multiple data load methods, including Stream Load, Broker Load, R We will provide detailed explanations of these two update methods in the documentation: [Update in Unique Key Model](../update/unique-update) and [Load Update in Unique Key Model](../update/update-of-unique-model). -### Update Transactions in Unique Key Model +### Concurrency Control for Primary Key Model Updates -#### Updating Data Using the `UPDATE` Statement +#### Using `UPDATE` Statements to Update Data -By default, Doris does not allow multiple `UPDATE` operations on the same table to occur concurrently. The `UPDATE` statement uses table-level locking to ensure transactional consistency. +By default, Doris does not allow multiple `UPDATE` operations on the same table at the same time. The `UPDATE` statement ensures isolation through table-level locking. -Users can adjust concurrency limits by modifying the FE configuration `enable_concurrent_update=true`. When concurrency limits are relaxed, the `UPDATE` statement will no longer provide transactional guarantees. +Users can adjust the concurrency limit by modifying the FE configuration `enable_concurrent_update=true`. When the concurrency limit is relaxed, if multiple `UPDATE` statements update the same row of data, the result will be undefined. #### Batch Updates Based on Load @@ -94,26 +120,12 @@ For more detailed information on transaction mechanisms, refer to the documentat The update in the aggregate model refers to the process of generating new aggregate values by combining new column values with existing aggregate values, according to the requirements of the aggregate functions. -New Agg Value = Agg Func ( Old Agg Value, New Column Value) - -The update in the aggregate model is only supported through load methods and does not support the use of Update statements. When defining a table in the aggregate model, if the aggregation function for the value column is defined as REPLACE_IF_NULL, it indirectly achieves partial column update capabilities similar to the unique key model. For more details, please refer to the documentation on [Load Update in the Aggregate Model](../update/update-of-aggregate-model). - -## Comparison of Update Capabilities for Different Models/Implementations +New Agg Value = Agg Func (Old Agg Value, New Column Value) -### Performance Comparison -| | Unique Key MoW | Unique Key MoR | Aggregate Key | -|----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|---------------| -| Import Speed | Deduplication is performed during import. Small-batch real-time writes incur approximately 10%-20% performance loss compared to MoR, while large-batch imports (e.g., tens or hundreds of millions of records) have about 30%-50% performance loss compared to MoR. | Similar to Duplicate Key | Similar to Duplicate Key | -| Query Speed | Similar to Duplicate Key | Requires deduplication during queries, with query time approximately 3-10 times that of MoW | If the aggregation function is REPLACE/REPLACE_IF_NOT_NULL, query speed is similar to MoR | -| Predicate Pushdown | Supported | Not Supported | Not Supported | -| Resource Consumption | - **Import Resource Consumption**: Consumes approximately 10%-30% more CPU compared to Duplicate Key/Unique Key MoR.<br /> - **Query Resource Consumption**: Similar to Duplicate Key with no additional resource consumption.<br /> - **Compaction Resource Consumption**: Higher memory and CPU usage compared to Duplicate Key, specific usage depends on data characteristics and volume. | - **Import Resource Consumption**: Similar to Duplicate Key with no additional reso [...] +The update in the aggregate model is only supported through load methods and does not support the use of Update statements. When defining a table in the aggregate model, if the aggregation function for the value column is defined as REPLACE_IF_NOT_NULL, it indirectly achieves partial column update capabilities similar to the unique key model. For more details, please refer to the documentation on [Load Update in the Aggregate Model](../update/update-of-aggregate-model). -### Feature Support Comparison -| | Unique Key MoW | Unique Key MoR | Aggregate Key | -|----------------|----------------|----------------|---------------| -| UPDATE | Supported | Supported | Not Supported | -| DELETE | Supported | Supported | Not Supported | -| sequence column| Supported | Supported | Not Supported | -| delete_sign | Supported | Supported | Not Supported | -| Partial Column Updates | Supported | Not Supported | Supported(can't update null value) | -| Inverted Index | Supported | Not Supported | Not Supported | +## Recommendations for Choosing Between Primary Key and Aggregate Models +- For most scenarios that require data updates, it is recommended to **prefer the primary key model**. Examples include synchronizing from TP databases to Doris via CDC, user profiling, and audience targeting. +- The following scenarios are recommended to use the aggregate model: + 1. Some fields need to be aggregated as metrics, while others need to be updated. + 2. Scenarios where partial column updates are needed, while being very sensitive to write performance and having low requirements for query latency, it is recommended to use the aggregate table with the REPLACE_IF_NOT_NULL aggregate function. \ No newline at end of file diff --git a/versioned_docs/version-2.1/data-operate/update/update-overview.md b/versioned_docs/version-2.1/data-operate/update/update-overview.md index a6fd54f2856..f2a861bed5d 100644 --- a/versioned_docs/version-2.1/data-operate/update/update-overview.md +++ b/versioned_docs/version-2.1/data-operate/update/update-overview.md @@ -24,7 +24,33 @@ specific language governing permissions and limitations under the License. --> -Data update primarily refers to the modification of the value column of data with the same key. For the primary(unique) key model, this update involves replacing the existing value, while for the aggregate model, it involves aggregating the values in the value column. +Data update refers to modifying the value columns in data records with the same key. The handling of data updates varies for different data models: + +- **Primary Key (Unique) Model**: The primary key model is specifically designed for data updates. Doris supports two storage implementations: Merge-on-Read (MoR) and Merge-on-Write (MoW). MoR optimizes write performance, while MoW provides better analytical performance. From Doris version 2.1, the default storage method is MoW. The primary key model supports using the `UPDATE` statement for small data updates and also supports batch updates through data loading. Loading methods include [...] + +- **Aggregate Model**: In the aggregate model, data update is a special use case. When the aggregate function is set to REPLACE or REPLACE_IF_NOT_NULL, data updates can be achieved. The aggregate model only supports updates based on data loading and does not support using the `UPDATE` statement. By setting the aggregate function to REPLACE_IF_NOT_NULL, partial column update capability can be achieved. + +By understanding the data update methods of different models, you can better choose the appropriate update strategy to meet specific business needs. + +## Comparison of Update Capabilities for Different Models/Implementations + +### Performance Comparison +| | Unique Key MoW | Unique Key MoR [...] +|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------ [...] +| Import Speed | Deduplication is performed during import. Small-batch real-time writes incur approximately 10%-20% performance loss compared to MoR, while large-batch imports (e.g., tens or hundreds of millions of records) have about 30%-50% performance loss compared to MoR. | Similar to Duplicate Key [...] +| Query Speed | Similar to Duplicate Key | Requires deduplication during queries, with query time approximately 3-10 times that of M [...] +| Predicate Pushdown | Supported | Not Supported [...] +| Resource Consumption | - **Import Resource Consumption**: Consumes approximately 10%-30% more CPU compared to Duplicate Key/Unique Key MoR.<br /> - **Query Resource Consumption**: Similar to Duplicate Key with no additional resource consumption.<br /> - **Compaction Resource Consumption**: Higher memory and CPU usage compared to Duplicate Key, specific usage depends on data characteristics and volume. | - **Import Resource Consumption**: Similar to Duplicate Key with no additional reso [...] + +### Feature Support Comparison +| | Unique Key MoW | Unique Key MoR | Aggregate Key | +|----------------|----------------|----------------|------------------------------------| +| UPDATE | Supported | Supported | Not Supported | +| DELETE | Supported | Supported | Not Supported | +| sequence column| Supported | Supported | Not Supported | +| delete_sign | Supported | Supported | Not Supported | +| Partial Column Updates | Supported | Not Supported | Supported (can't update null value) | +| Inverted Index | Supported | Not Supported | Not Supported | ## Update in Primary Key (Unique) Model @@ -35,14 +61,14 @@ By default, in Doris 2.0, the unique key model is still based on MoR. To create ```sql CREATE TABLE IF NOT EXISTS example_tbl_unique_merge_on_write ( - `user_id` LARGEINT NOT NULL, - `username` VARCHAR(50) NOT NULL , - `city` VARCHAR(20), - `age` SMALLINT, - `sex` TINYINT, - `phone` LARGEINT, - `address` VARCHAR(500), - `register_time` DATETIME + `user_id` LARGEINT NOT NULL, + `username` VARCHAR(50) NOT NULL , + `city` VARCHAR(20), + `age` SMALLINT, + `sex` TINYINT, + `phone` LARGEINT, + `address` VARCHAR(500), + `register_time` DATETIME ) UNIQUE KEY(`user_id`, `username`) DISTRIBUTED BY HASH(`user_id`) BUCKETS 1 @@ -72,13 +98,13 @@ Doris supports multiple data load methods, including Stream Load, Broker Load, R We will provide detailed explanations of these two update methods in the documentation: [Update in Unique Key Model](../update/unique-update) and [Load Update in Unique Key Model](../update/update-of-unique-model). -### Update Transactions in Unique Key Model +### Concurrency Control for Primary Key Model Updates -#### Updating Data Using the `UPDATE` Statement +#### Using `UPDATE` Statements to Update Data -By default, Doris does not allow multiple `UPDATE` operations on the same table to occur concurrently. The `UPDATE` statement uses table-level locking to ensure transactional consistency. +By default, Doris does not allow multiple `UPDATE` operations on the same table at the same time. The `UPDATE` statement ensures isolation through table-level locking. -Users can adjust concurrency limits by modifying the FE configuration `enable_concurrent_update=true`. When concurrency limits are relaxed, the `UPDATE` statement will no longer provide transactional guarantees. +Users can adjust the concurrency limit by modifying the FE configuration `enable_concurrent_update=true`. When the concurrency limit is relaxed, if multiple `UPDATE` statements update the same row of data, the result will be undefined. #### Batch Updates Based on Load @@ -94,26 +120,12 @@ For more detailed information on transaction mechanisms, refer to the documentat The update in the aggregate model refers to the process of generating new aggregate values by combining new column values with existing aggregate values, according to the requirements of the aggregate functions. -New Agg Value = Agg Func ( Old Agg Value, New Column Value) - -The update in the aggregate model is only supported through load methods and does not support the use of Update statements. When defining a table in the aggregate model, if the aggregation function for the value column is defined as REPLACE_IF_NULL, it indirectly achieves partial column update capabilities similar to the unique key model. For more details, please refer to the documentation on [Load Update in the Aggregate Model](../update/update-of-aggregate-model). - -## Comparison of Update Capabilities for Different Models/Implementations +New Agg Value = Agg Func (Old Agg Value, New Column Value) -### Performance Comparison -| | Unique Key MoW | Unique Key MoR | Aggregate Key | -|----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|---------------| -| Import Speed | Deduplication is performed during import. Small-batch real-time writes incur approximately 10%-20% performance loss compared to MoR, while large-batch imports (e.g., tens or hundreds of millions of records) have about 30%-50% performance loss compared to MoR. | Similar to Duplicate Key | Similar to Duplicate Key | -| Query Speed | Similar to Duplicate Key | Requires deduplication during queries, with query time approximately 3-10 times that of MoW | If the aggregation function is REPLACE/REPLACE_IF_NOT_NULL, query speed is similar to MoR | -| Predicate Pushdown | Supported | Not Supported | Not Supported | -| Resource Consumption | - **Import Resource Consumption**: Consumes approximately 10%-30% more CPU compared to Duplicate Key/Unique Key MoR.<br /> - **Query Resource Consumption**: Similar to Duplicate Key with no additional resource consumption.<br /> - **Compaction Resource Consumption**: Higher memory and CPU usage compared to Duplicate Key, specific usage depends on data characteristics and volume. | - **Import Resource Consumption**: Similar to Duplicate Key with no additional reso [...] +The update in the aggregate model is only supported through load methods and does not support the use of Update statements. When defining a table in the aggregate model, if the aggregation function for the value column is defined as REPLACE_IF_NOT_NULL, it indirectly achieves partial column update capabilities similar to the unique key model. For more details, please refer to the documentation on [Load Update in the Aggregate Model](../update/update-of-aggregate-model). -### Feature Support Comparison -| | Unique Key MoW | Unique Key MoR | Aggregate Key | -|----------------|----------------|----------------|---------------| -| UPDATE | Supported | Supported | Not Supported | -| DELETE | Supported | Supported | Not Supported | -| sequence column| Supported | Supported | Not Supported | -| delete_sign | Supported | Supported | Not Supported | -| Partial Column Updates | Supported | Not Supported | Supported(can't update null value) | -| Inverted Index | Supported | Not Supported | Not Supported | +## Recommendations for Choosing Between Primary Key and Aggregate Models +- For most scenarios that require data updates, it is recommended to **prefer the primary key model**. Examples include synchronizing from TP databases to Doris via CDC, user profiling, and audience targeting. +- The following scenarios are recommended to use the aggregate model: + 1. Some fields need to be aggregated as metrics, while others need to be updated. + 2. Scenarios where partial column updates are needed, while being very sensitive to write performance and having low requirements for query latency, it is recommended to use the aggregate table with the REPLACE_IF_NOT_NULL aggregate function. \ No newline at end of file diff --git a/versioned_docs/version-3.0/data-operate/update/update-overview.md b/versioned_docs/version-3.0/data-operate/update/update-overview.md index a6fd54f2856..77a2d7f7382 100644 --- a/versioned_docs/version-3.0/data-operate/update/update-overview.md +++ b/versioned_docs/version-3.0/data-operate/update/update-overview.md @@ -24,7 +24,33 @@ specific language governing permissions and limitations under the License. --> -Data update primarily refers to the modification of the value column of data with the same key. For the primary(unique) key model, this update involves replacing the existing value, while for the aggregate model, it involves aggregating the values in the value column. +Data update refers to modifying the value columns in data records with the same key. The handling of data updates varies for different data models: + +- **Primary Key (Unique) Model**: The primary key model is specifically designed for data updates. Doris supports two storage implementations: Merge-on-Read (MoR) and Merge-on-Write (MoW). MoR optimizes write performance, while MoW provides better analytical performance. From Doris version 2.1, the default storage method is MoW. The primary key model supports using the `UPDATE` statement for small data updates and also supports batch updates through data loading. Loading methods include [...] + +- **Aggregate Model**: In the aggregate model, data update is a special use case. When the aggregate function is set to REPLACE or REPLACE_IF_NOT_NULL, data updates can be achieved. The aggregate model only supports updates based on data loading and does not support using the `UPDATE` statement. By setting the aggregate function to REPLACE_IF_NOT_NULL, partial column update capability can be achieved. + +By understanding the data update methods of different models, you can better choose the appropriate update strategy to meet specific business needs. + +## Comparison of Update Capabilities for Different Models/Implementations + +### Performance Comparison +| | Unique Key MoW | Unique Key MoR [...] +|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------ [...] +| Import Speed | Deduplication is performed during import. Small-batch real-time writes incur approximately 10%-20% performance loss compared to MoR, while large-batch imports (e.g., tens or hundreds of millions of records) have about 30%-50% performance loss compared to MoR. | Similar to Duplicate Key [...] +| Query Speed | Similar to Duplicate Key | Requires deduplication during queries, with query time approximately 3-10 times that of M [...] +| Predicate Pushdown | Supported | Not Supported [...] +| Resource Consumption | - **Import Resource Consumption**: Consumes approximately 10%-30% more CPU compared to Duplicate Key/Unique Key MoR.<br /> - **Query Resource Consumption**: Similar to Duplicate Key with no additional resource consumption.<br /> - **Compaction Resource Consumption**: Higher memory and CPU usage compared to Duplicate Key, specific usage depends on data characteristics and volume. | - **Import Resource Consumption**: Similar to Duplicate Key with no additional reso [...] + +### Feature Support Comparison +| | Unique Key MoW | Unique Key MoR | Aggregate Key | +|----------------|----------------|----------------|------------------------------------| +| UPDATE | Supported | Supported | Not Supported | +| DELETE | Supported | Supported | Not Supported | +| sequence column| Supported | Supported | Not Supported | +| delete_sign | Supported | Supported | Not Supported | +| Partial Column Updates | Supported | Not Supported | Supported (can't update null value) | +| Inverted Index | Supported | Not Supported | Not Supported | ## Update in Primary Key (Unique) Model @@ -35,14 +61,14 @@ By default, in Doris 2.0, the unique key model is still based on MoR. To create ```sql CREATE TABLE IF NOT EXISTS example_tbl_unique_merge_on_write ( - `user_id` LARGEINT NOT NULL, - `username` VARCHAR(50) NOT NULL , - `city` VARCHAR(20), - `age` SMALLINT, - `sex` TINYINT, - `phone` LARGEINT, - `address` VARCHAR(500), - `register_time` DATETIME + `user_id` LARGEINT NOT NULL, + `username` VARCHAR(50) NOT NULL , + `city` VARCHAR(20), + `age` SMALLINT, + `sex` TINYINT, + `phone` LARGEINT, + `address` VARCHAR(500), + `register_time` DATETIME ) UNIQUE KEY(`user_id`, `username`) DISTRIBUTED BY HASH(`user_id`) BUCKETS 1 @@ -53,7 +79,7 @@ PROPERTIES ( ``` :::caution -Starting from Doris 2.1, MoW is the default mode for the unique key model. So, if you are using Doris 2.1 or higher version, make sure to read the relevant table creation documentation. +Starting from Doris 2.1, MoW is the default mode for the unique key model. So, if you are using Doris 2.1 or a higher version, make sure to read the relevant table creation documentation. ::: ### Two Update Methods in Unique Key Model @@ -72,13 +98,13 @@ Doris supports multiple data load methods, including Stream Load, Broker Load, R We will provide detailed explanations of these two update methods in the documentation: [Update in Unique Key Model](../update/unique-update) and [Load Update in Unique Key Model](../update/update-of-unique-model). -### Update Transactions in Unique Key Model +### Concurrency Control for Primary Key Model Updates -#### Updating Data Using the `UPDATE` Statement +#### Using `UPDATE` Statements to Update Data -By default, Doris does not allow multiple `UPDATE` operations on the same table to occur concurrently. The `UPDATE` statement uses table-level locking to ensure transactional consistency. +By default, Doris does not allow multiple `UPDATE` operations on the same table at the same time. The `UPDATE` statement ensures isolation through table-level locking. -Users can adjust concurrency limits by modifying the FE configuration `enable_concurrent_update=true`. When concurrency limits are relaxed, the `UPDATE` statement will no longer provide transactional guarantees. +Users can adjust the concurrency limit by modifying the FE configuration `enable_concurrent_update=true`. When the concurrency limit is relaxed, if multiple `UPDATE` statements update the same row of data, the result will be undefined. #### Batch Updates Based on Load @@ -94,26 +120,12 @@ For more detailed information on transaction mechanisms, refer to the documentat The update in the aggregate model refers to the process of generating new aggregate values by combining new column values with existing aggregate values, according to the requirements of the aggregate functions. -New Agg Value = Agg Func ( Old Agg Value, New Column Value) - -The update in the aggregate model is only supported through load methods and does not support the use of Update statements. When defining a table in the aggregate model, if the aggregation function for the value column is defined as REPLACE_IF_NULL, it indirectly achieves partial column update capabilities similar to the unique key model. For more details, please refer to the documentation on [Load Update in the Aggregate Model](../update/update-of-aggregate-model). - -## Comparison of Update Capabilities for Different Models/Implementations +New Agg Value = Agg Func (Old Agg Value, New Column Value) -### Performance Comparison -| | Unique Key MoW | Unique Key MoR | Aggregate Key | -|----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|---------------| -| Import Speed | Deduplication is performed during import. Small-batch real-time writes incur approximately 10%-20% performance loss compared to MoR, while large-batch imports (e.g., tens or hundreds of millions of records) have about 30%-50% performance loss compared to MoR. | Similar to Duplicate Key | Similar to Duplicate Key | -| Query Speed | Similar to Duplicate Key | Requires deduplication during queries, with query time approximately 3-10 times that of MoW | If the aggregation function is REPLACE/REPLACE_IF_NOT_NULL, query speed is similar to MoR | -| Predicate Pushdown | Supported | Not Supported | Not Supported | -| Resource Consumption | - **Import Resource Consumption**: Consumes approximately 10%-30% more CPU compared to Duplicate Key/Unique Key MoR.<br /> - **Query Resource Consumption**: Similar to Duplicate Key with no additional resource consumption.<br /> - **Compaction Resource Consumption**: Higher memory and CPU usage compared to Duplicate Key, specific usage depends on data characteristics and volume. | - **Import Resource Consumption**: Similar to Duplicate Key with no additional reso [...] +The update in the aggregate model is only supported through load methods and does not support the use of Update statements. When defining a table in the aggregate model, if the aggregation function for the value column is defined as REPLACE_IF_NOT_NULL, it indirectly achieves partial column update capabilities similar to the unique key model. For more details, please refer to the documentation on [Load Update in the Aggregate Model](../update/update-of-aggregate-model). -### Feature Support Comparison -| | Unique Key MoW | Unique Key MoR | Aggregate Key | -|----------------|----------------|----------------|---------------| -| UPDATE | Supported | Supported | Not Supported | -| DELETE | Supported | Supported | Not Supported | -| sequence column| Supported | Supported | Not Supported | -| delete_sign | Supported | Supported | Not Supported | -| Partial Column Updates | Supported | Not Supported | Supported(can't update null value) | -| Inverted Index | Supported | Not Supported | Not Supported | +## Recommendations for Choosing Between Primary Key and Aggregate Models +- For most scenarios that require data updates, it is recommended to **prefer the primary key model**. Examples include synchronizing from TP databases to Doris via CDC, user profiling, and audience targeting. +- The following scenarios are recommended to use the aggregate model: + 1. Some fields need to be aggregated as metrics, while others need to be updated. + 2. Scenarios where partial column updates are needed, while being very sensitive to write performance and having low requirements for query latency, it is recommended to use the aggregate table with the REPLACE_IF_NOT_NULL aggregate function. \ No newline at end of file --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org