This is an automated email from the ASF dual-hosted git repository. zhangchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 99509aa092 [doc](update) address comments update-of-unique-model.md (#1761) 99509aa092 is described below commit 99509aa09242734108132813f25c274f040c44f2 Author: zhannngchen <zhangc...@selectdb.com> AuthorDate: Tue Jan 14 14:07:20 2025 +0800 [doc](update) address comments update-of-unique-model.md (#1761) ## Versions - [x] dev - [x] 3.0 - [x] 2.1 - [ ] 2.0 ## Languages - [x] Chinese - [x] English ## Docs Checklist - [x] Checked by AI - [x] Test Cases Built --- docs/data-operate/update/update-of-unique-model.md | 121 ++++++++++----------- .../data-operate/update/update-of-unique-model.md | 89 ++++++++++----- .../data-operate/update/update-of-unique-model.md | 101 ++++++++--------- .../data-operate/update/update-of-unique-model.md | 102 ++++++++--------- .../data-operate/update/update-of-unique-model.md | 117 +++++++++----------- .../data-operate/update/update-of-unique-model.md | 117 +++++++++----------- 6 files changed, 313 insertions(+), 334 deletions(-) diff --git a/docs/data-operate/update/update-of-unique-model.md b/docs/data-operate/update/update-of-unique-model.md index 8e52ad776e..954585bd77 100644 --- a/docs/data-operate/update/update-of-unique-model.md +++ b/docs/data-operate/update/update-of-unique-model.md @@ -24,130 +24,119 @@ specific language governing permissions and limitations under the License. --> -This document primarily introduces the updates based on the load data on the Doris Unique Key model. +This document introduces how to update data in the Doris primary key model using various load methods. -## Full Row Update +## Whole Row Update -When loading data into the primary key model (Unique model) in Doris using supported load methods such as Stream Load, Broker Load, Routine Load, Insert Into, etc., if there are no corresponding data rows with the primary key, new data will be inserted. If there are corresponding data rows with the primary key, the data will be updated. In other words, loading data into the Doris primary key model follows an "upsert" mode. Based on the import, updating existing records is by default the [...] +When loading data into the primary key model (Unique model) using Doris-supported methods like Stream Load, Broker Load, Routine Load, Insert Into, etc., new data is inserted if there is no existing primary key data row. If there is an existing primary key data row, it is updated. This means the load operation in the Doris primary key model works in an "upsert" mode. The process of updating existing records is the same as loading new records by default, so you can refer to the data load [...] -## Partial Update +## Partial Column Update -Updating partial columns mainly refers to directly updating certain field values in a table instead of updating all field values. This can be done using the Update statement, which typically involves reading the entire row data, updating specific field values, and then writing it back. This read-write transaction is time-consuming and not suitable for writing large amounts of data. In the context of load updates on the primary key model, Doris provides a functionality to directly insert [...] +Partial column update allows you to update specific fields in a table without modifying all fields. You can use the Update statement to perform this operation, which typically involves reading the entire row, updating the desired fields, and writing it back. This read-write transaction is time-consuming and not suitable for large-scale data writing. Doris provides a feature to directly insert or update partial column data in the primary key model load update, bypassing the need to read t [...] -:::caution Note: - -1. Partial updates are only supported in the Merge-on-Write implementation of the Unique Key starting from version 2.0. -2. Starting from version 2.0.2, partial updates are supported using INSERT INTO. -3. Partial updates are not supported on tables with materialized views. -::: +:::caution Note -### Use Cases +1. Version 2.0 only supports partial column updates in the Merge-on-Write implementation of the Unique Key. +2. Starting from version 2.0.2, partial column updates are supported using INSERT INTO. +3. Partial column updates are not supported on tables with synchronized materialized views. -- Real-time dynamic column updates that require high-frequency updates on certain fields in the table. For example, in a user tag table, there are fields containing the latest user behavior information that needs real-time updates to enable real-time analysis and decision-making in advertising/recommendation systems. +::: -- Combining multiple source tables into a large denormalized table. +### Applicable Scenarios +- Real-time dynamic column updates, requiring frequent updates of specific fields in the table. For example, updating fields related to the latest user behavior in a user tag table for real-time analysis and decision-making in advertising/recommendation systems. +- Merging multiple source tables into one large wide table. - Data correction. -### Usage +### Usage Example -#### Table Creation +Assume there is an order table `order_tbl` in Doris, where the order id is the Key column, and the order status and order amount are the Value columns. The data status is as follows: -When creating the table, the following property needs to be specified to enable the Merge-on-Write implementation: +| Order id | Order Amount | Order Status | +| -------- | -------------| -------------| +| 1 | 100 | Pending Payment | ```sql -enable_unique_key_merge_on_write = true ++----------+--------------+--------------+ +| order_id | order_amount | order_status | ++----------+--------------+--------------+ +| 1 | 100 | Pending Payment | ++----------+--------------+--------------+ +1 row in set (0.01 sec) ``` -#### Load +After the user clicks to pay, the Doris system needs to change the order status of the order with order id '1' to 'Pending Shipment'. + +#### Partial Column Update Using Load Methods **StreamLoad/BrokerLoad/RoutineLoad** -If you are using Stream Load/Broker Load/Routine Load, add the following header during the load: +Prepare the following CSV file: -```sql -partial_columns: true +``` +1,Pending Shipment ``` -Also, specify the columns to be loaded in the `columns` section (all key columns must be included, otherwise updates won't be possible). Below is an example of Stream Load: +Add the following header during load: ```sql -$ curl --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H "columns:order_id,order_status" -T /tmp/update.csv http://127.0.0.1:8030/api/db1/order_tbl/_stream_load +partial_columns:true ``` -**INSERT INTO** - -In all data models, when using `INSERT INTO` with a subset of columns, the default behavior is to insert the entire row. To enable partial column updates in the Merge-on-Write implementation, the following session variable needs to be set: +Specify the columns to be loaded in `columns` (must include all key columns). Below is an example of Stream Load: ```sql -SET enable_unique_key_partial_update=true -INSERT INTO order_tbl (order_id, order_status) VALUES (1,'To be shipped'); +curl --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H "columns:order_id,order_status" -T /tmp/update.csv http://127.0.0.1:8030/api/db1/order_tbl/_stream_load ``` -Note that the default value for the session variable `enable_insert_strict`, which controls whether the insert statement operates in strict mode, is true. In strict mode, updating non-existing keys during partial column updates is not allowed. So, if you want to insert non-existing keys during partial column updates using the insert statement, you need to set `enable_unique_key_partial_update` to true and also set `enable_insert_strict` to false. - -**Flink Connector** +**INSERT INTO** -If you are using the Flink Connector, add the following configuration: +In all data models, the default behavior of `INSERT INTO` when given partial columns is to write the entire row. To prevent misuse, in the Merge-on-Write implementation, `INSERT INTO` maintains the semantics of whole row UPSERT by default. To enable partial column updates, set the following session variable: ```sql -'sink.properties.partial_columns' = 'true', +SET enable_unique_key_partial_update=true; +INSERT INTO order_tbl (order_id, order_status) VALUES (1, 'Pending Shipment'); ``` -Also, specify the columns to be loaded in `sink.properties.column` (all key columns must be included, otherwise updates won't be possible). - -### Example +Note that the session variable `enable_insert_strict` defaults to true, enabling strict mode by default. In strict mode, partial column updates do not allow updating non-existent keys. To insert non-existent keys using the insert statement for partial column updates, set `enable_unique_key_partial_update` to true and `enable_insert_strict` to false. -Suppose there is an order table named `order_tbl` in Doris, where the order ID is a key column, and the order status and order amount are value columns. The data is as follows: +**Flink Connector** -| Order ID | Order Amount | Order Status | -| -------- | ------------ | ------------ | -| 1 | 100 | Pending | +If using Flink Connector, add the following configuration: ```sql -+----------+--------------+--------------+ -| order_id | order_amount | order_status | -+----------+--------------+--------------+ -| 1 | 100 | Pending | -+----------+--------------+--------------+ -1 row in set (0.01 sec) +'sink.properties.partial_columns' = 'true', ``` -Now, when a user clicks on payment, the Doris system needs to update the order status of the order with ID '1' to 'To be shipped'. - -We use `INSERT INTO` to update the data. +Specify the columns to be loaded in `sink.properties.column` (must include all key columns). -```sql -SET enable_unique_key_partial_update=true; -INSERT INTO order_tbl (order_id, order_status) VALUES (1,'To be shipped'); -``` +#### Update Result -After the update, the result is as follows: +The result after the update is as follows: ```sql +----------+--------------+--------------+ | order_id | order_amount | order_status | +----------+--------------+--------------+ -| 1 | 100 | To be shipped | +| 1 | 100 | Pending Shipment | +----------+--------------+--------------+ 1 row in set (0.01 sec) ``` -### Notes - -Due to the Merge-on-Write implementation requiring data completion during data writing to ensure optimal query performance, performing partial column updates using the Merge-on-Write implementation may result in a decrease in load performance. +### Usage Notes -Suggestions for improving load performance: +Since the Merge-on-Write implementation needs to complete the entire row of data during writing to ensure optimal query performance, using it for partial column updates may decrease partial load performance. -- Use SSDs equipped with NVMe or high-speed SSD cloud disks. Reading historical data in large quantities during data completion will generate high read IOPS and read throughput. +Performance optimization suggestions: -- Enabling row storage can significantly reduce the IOPS generated during data completion, resulting in noticeable improvements in load performance. Users can enable row storage by using the following property when creating the table: +- Use SSDs equipped with NVMe or high-speed SSD cloud disks, as completing data will read a large amount of historical data, generating high read IOPS and throughput. +- Enabling row storage can reduce the IOPS generated when completing data, significantly improving load performance. Enable row storage by setting the following property when creating a table: ```Plain "store_row_column" = "true" ``` -Now, all rows in a batch write task (whether it is a load task or `INSERT INTO`) can only update the same columns. If you need to update different columns, you will need to perform separate batch writes. +Currently, all rows in the same batch data writing task (whether a load task or `INSERT INTO`) can only update the same columns. To update data with different columns, write in different batches. ## Flexible Partial Column Updates @@ -335,10 +324,10 @@ The final data in the table is as follows: | k | v1 | v2 | v3 | v4 | v5 | +---+--------+--------+-----+------+--------+ | 0 | 0 | 0 | 0 | 0 | 0 | -| 1 | 1 | 1 | 1 | 1 | 1 | -| 5 | 5 | 5 | 5 | 5 | 5 | +| 1 | 1 | 1 | 1 | 1 | 10 | +| 5 | 5 | 5 | 5 | 5 | 50 | | 2 | 2 | 222 | 2 | 2 | 25 | -| 3 | 3 | 3 | 333 | 3 | 3 | +| 3 | 3 | 3 | 333 | 3 | 30 | | 4 | 411 | <null> | 433 | 444 | 50 | | 6 | 611 | 9876 | 633 | 1234 | <null> | | 7 | <null> | 9876 | 733 | 1234 | 300 | diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/update/update-of-unique-model.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/update/update-of-unique-model.md index 6a430d0153..f961a5f960 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/update/update-of-unique-model.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/update/update-of-unique-model.md @@ -24,84 +24,119 @@ specific language governing permissions and limitations under the License. --> -这篇文档主要介绍 Doris 主键模型上基于导入的更新。 +这篇文档主要介绍 Doris 主键模型基于导入的更新。 ## 整行更新 -使用 Doris 支持的 Stream Load,Broker Load,Routine Load,Insert Into 等导入方式,往主键模型(Unique 模型)中进行数据导入时,如果没有相应主键的数据行,就执行插入新的数据,如果有相应主键的数据行,就进行更新。也就是说,Doris 主键模型的导入是一种“upsert”模式。基于导入,对已有记录的更新,默认和导入一个新记录是完全一样的,所以,这里可以参考数据导入的文档部分。 +使用 Doris 支持的 Stream Load、Broker Load、Routine Load、Insert Into 等导入方式,向主键模型(Unique 模型)导入数据时,如果没有相应主键的数据行,则插入新数据;如果有相应主键的数据行,则进行更新。也就是说,Doris 主键模型的导入是一种“upsert”模式。基于导入,对已有记录的更新,默认和导入一个新记录是完全一样的,因此可以参考数据导入的文档部分。 ## 部分列更新 -部分列更新,主要是指直接更新表中某些字段值,而不是全部的字段值。可以采用 Update 语句来进行更新,这种 Update 语句一般采用先将整行数据读出,然后再更新部分字段值,再写回。这种读写事务非常耗时,并且不适合大批量数据写入。Doris 在主键模型的导入更新中,提供了可以直接插入或者更新部分列数据的功能,不需要先读取整行数据,这样更新效率就大幅提升了。 +部分列更新是指直接更新表中某些字段值,而不是全部字段值。可以使用 Update 语句进行更新,这种 Update 语句通常先读取整行数据,然后更新部分字段值,再写回。这种读写事务非常耗时,不适合大批量数据写入。Doris 在主键模型的导入更新中,提供了直接插入或更新部分列数据的功能,不需要先读取整行数据,从而大幅提升更新效率。 :::caution 注意 -1. 2.0 版本仅在 Unique Key 的 Merge-on-Write 实现中支持了部分列更新能力 -2. 从 2.0.2 版本开始,支持使用 INSERT INTO 进行部分列更新 -3. 不支持在有同步物化视图的表上进行部分列更新 +1. 2.0 版本仅在 Unique Key 的 Merge-on-Write 实现中支持部分列更新能力。 +2. 从 2.0.2 版本开始,支持使用 INSERT INTO 进行部分列更新。 +3. 不支持在有同步物化视图的表上进行部分列更新。 + ::: ### 适用场景 -- 实时的动态列更新,需要在表中实时的高频更新某些字段值。例如用户标签表中有一些关于用户最新行为信息的字段需要实时的更新,以实现广告/推荐等系统能够据其进行实时的分析和决策。 +- 实时动态列更新,需要在表中实时高频更新某些字段值。例如用户标签表中有一些关于用户最新行为信息的字段需要实时更新,以便广告/推荐系统能够据此进行实时分析和决策。 +- 将多张源表拼接成一张大宽表。 +- 数据修正。 -- 将多张源表拼接成一张大宽表 +### 使用示例 -- 数据修正 +假设 Doris 中存在一张订单表 order_tbl,其中订单 id 是 Key 列,订单状态和订单金额是 Value 列。数据状态如下: -### 使用方式 +| 订单 id | 订单金额 | 订单状态 | +| ------ | -------- | -------- | +| 1 | 100 | 待付款 | + +```sql ++----------+--------------+--------------+ +| order_id | order_amount | order_status | ++----------+--------------+--------------+ +| 1 | 100 | 待付款 | ++----------+--------------+--------------+ +1 row in set (0.01 sec) +``` + +此时,用户点击付款后,Doris 系统需要将订单 id 为 '1' 的订单状态变更为 '待发货'。 + +#### 可以使用以下导入方式进行部分列更新 + +**StreamLoad/BrokerLoad/RoutineLoad** -**StreamLoad/BrokerLoad/RoutineLoad 导入** +准备如下 csv 文件: -如果使用的是 Stream Load/Broker Load/Routine Load,在导入时添加如下 header +``` +1,待发货 +``` + +在导入时添加如下 header: ```sql partial_columns:true ``` -同时在`columns`中指定要导入的列(必须包含所有 key 列,不然无法更新),下面是一个 Stream Load 的例子: +同时在 `columns` 中指定要导入的列(必须包含所有 key 列,否则无法更新)。下面是一个 Stream Load 的例子: ```sql -curl --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H "columns:order_id,order_status" -T /tmp/update.csv http://127.0.0.1:8030/api/db1/order_tbl/_stream_load +curl --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H "columns:order_id,order_status" -T /tmp/update.csv http://127.0.0.1:8030/api/db1/order_tbl/_stream_load ``` -**INSERT INTO 导入** +**INSERT INTO** -在所有的数据模型中,`INSERT INTO` 给定一部分列时默认行为都是整行写入,为了防止误用,在 Merge-on-Write 实现中,`INSERT INTO`默认仍然保持整行 UPSERT 的语义,如果需要开启部分列更新的语义,需要设置如下 session variable +在所有数据模型中,`INSERT INTO` 给定部分列时默认行为是整行写入。为了防止误用,在 Merge-on-Write 实现中,`INSERT INTO` 默认仍然保持整行 UPSERT 的语义。如果需要开启部分列更新的语义,需要设置如下 session variable: ```sql -SET enable_unique_key_partial_update=true -INSERT INTO order_tbl (order_id, order_status) VALUES (1,'待发货'); +SET enable_unique_key_partial_update=true; +INSERT INTO order_tbl (order_id, order_status) VALUES (1, '待发货'); ``` -需要注意的是,控制 insert 语句是否开启严格模式的会话变量`enable_insert_strict`的默认值为 true,即 insert 语句默认开启严格模式,而在严格模式下进行部分列更新不允许更新不存在的 key。所以,在使用 insert 语句进行部分列更新的时候如果希望能插入不存在的 key,需要在`enable_unique_key_partial_update`设置为 true 的基础上同时将`enable_insert_strict`设置为 false。 +需要注意的是,控制 insert 语句是否开启严格模式的会话变量 `enable_insert_strict` 的默认值为 true,即 insert 语句默认开启严格模式。在严格模式下进行部分列更新不允许更新不存在的 key。所以,在使用 insert 语句进行部分列更新时,如果希望能插入不存在的 key,需要在 `enable_unique_key_partial_update` 设置为 true 的基础上,同时将 `enable_insert_strict` 设置为 false。 -**Flink Connector 导入** +**Flink Connector** -如果使用 Flink Connector, 需要添加如下配置: +如果使用 Flink Connector,需要添加如下配置: ```sql 'sink.properties.partial_columns' = 'true', ``` -同时在`sink.properties.column`中指定要导入的列(必须包含所有 key 列,不然无法更新) +同时在 `sink.properties.column` 中指定要导入的列(必须包含所有 key 列,否则无法更新)。 + +#### 更新结果 +更新后结果如下: + +```sql ++----------+--------------+--------------+ +| order_id | order_amount | order_status | ++----------+--------------+--------------+ +| 1 | 100 | 待发货 | ++----------+--------------+--------------+ +1 row in set (0.01 sec) +``` ### 使用注意 -由于 Merge-on-Write 实现需要在数据写入的时候,进行整行数据的补齐,以保证最优的查询性能,因此使用 Merge-on-Write 实现进行部分列更新会有部分导入性能下降。 +由于 Merge-on-Write 实现需要在数据写入时进行整行数据的补齐,以保证最优的查询性能,因此使用 Merge-on-Write 实现进行部分列更新会导致部分导入性能下降。 写入性能优化建议: -- 使用配备了 NVMe 的 SSD,或者极速 SSD 云盘。因为补齐数据时会大量的读取历史数据,产生较高的读 IOPS,以及读吞吐 - -- 开启行存将能够大大减少补齐数据时产生的 IOPS,导入性能提升明显,用户可以在建表时通过如下 property 来开启行存: +- 使用配备 NVMe 的 SSD,或者极速 SSD 云盘。因为补齐数据时会大量读取历史数据,产生较高的读 IOPS 以及读吞吐。 +- 开启行存能够大大减少补齐数据时产生的 IOPS,导入性能提升明显。用户可以在建表时通过如下 property 来开启行存: ```Plain "store_row_column" = "true" ``` -目前,同一批次数据写入任务(无论是导入任务还是`INSERT INTO`)的所有行只能更新相同的列,如果需要更新不同列的数据,则需要分不同的批次进行写入。 +目前,同一批次数据写入任务(无论是导入任务还是 `INSERT INTO`)的所有行只能更新相同的列。如果需要更新不同列的数据,则需要分不同批次进行写入。 ## 灵活部分列更新 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/update/update-of-unique-model.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/update/update-of-unique-model.md index e5eb8a858c..9fc8d2d911 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/update/update-of-unique-model.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/update/update-of-unique-model.md @@ -24,105 +24,95 @@ specific language governing permissions and limitations under the License. --> -这篇文档主要介绍 Doris 主键模型上基于导入的更新。 +这篇文档主要介绍 Doris 主键模型基于导入的更新。 ## 整行更新 -使用 Doris 支持的 Stream Load,Broker Load,Routine Load,Insert Into 等导入方式,往主键模型(Unique 模型)中进行数据导入时,如果没有相应主键的数据行,就执行插入新的数据,如果有相应主键的数据行,就进行更新。也就是说,Doris 主键模型的导入是一种“upsert”模式。基于导入,对已有记录的更新,默认和导入一个新记录是完全一样的,所以,这里可以参考数据导入的文档部分。 +使用 Doris 支持的 Stream Load、Broker Load、Routine Load、Insert Into 等导入方式,向主键模型(Unique 模型)导入数据时,如果没有相应主键的数据行,则插入新数据;如果有相应主键的数据行,则进行更新。也就是说,Doris 主键模型的导入是一种“upsert”模式。基于导入,对已有记录的更新,默认和导入一个新记录是完全一样的,因此可以参考数据导入的文档部分。 ## 部分列更新 -部分列更新,主要是指直接更新表中某些字段值,而不是全部的字段值。可以采用 Update 语句来进行更新,这种 Update 语句一般采用先将整行数据读出,然后再更新部分字段值,再写回。这种读写事务非常耗时,并且不适合大批量数据写入。Doris 在主键模型的导入更新中,提供了可以直接插入或者更新部分列数据的功能,不需要先读取整行数据,这样更新效率就大幅提升了。 +部分列更新是指直接更新表中某些字段值,而不是全部字段值。可以使用 Update 语句进行更新,这种 Update 语句通常先读取整行数据,然后更新部分字段值,再写回。这种读写事务非常耗时,不适合大批量数据写入。Doris 在主键模型的导入更新中,提供了直接插入或更新部分列数据的功能,不需要先读取整行数据,从而大幅提升更新效率。 :::caution 注意 -1. 2.0 版本仅在 Unique Key 的 Merge-on-Write 实现中支持了部分列更新能力 -2. 从 2.0.2 版本开始,支持使用 INSERT INTO 进行部分列更新 -3. 不支持在有同步物化视图的表上进行部分列更新 +1. 2.0 版本仅在 Unique Key 的 Merge-on-Write 实现中支持部分列更新能力。 +2. 从 2.0.2 版本开始,支持使用 INSERT INTO 进行部分列更新。 +3. 不支持在有同步物化视图的表上进行部分列更新。 + ::: ### 适用场景 -- 实时的动态列更新,需要在表中实时的高频更新某些字段值。例如用户标签表中有一些关于用户最新行为信息的字段需要实时的更新,以实现广告/推荐等系统能够据其进行实时的分析和决策。 - -- 将多张源表拼接成一张大宽表 - -- 数据修正 +- 实时动态列更新,需要在表中实时高频更新某些字段值。例如用户标签表中有一些关于用户最新行为信息的字段需要实时更新,以便广告/推荐系统能够据此进行实时分析和决策。 +- 将多张源表拼接成一张大宽表。 +- 数据修正。 -### 使用方式 +### 使用示例 -#### 建表 +假设 Doris 中存在一张订单表 order_tbl,其中订单 id 是 Key 列,订单状态和订单金额是 Value 列。数据状态如下: -建表时需要指定如下 property,以开启 Merge-on-Write 实现 +| 订单 id | 订单金额 | 订单状态 | +| ------ | -------- | -------- | +| 1 | 100 | 待付款 | ```sql -enable_unique_key_merge_on_write = true ++----------+--------------+--------------+ +| order_id | order_amount | order_status | ++----------+--------------+--------------+ +| 1 | 100 | 待付款 | ++----------+--------------+--------------+ +1 row in set (0.01 sec) ``` -#### 导入 +此时,用户点击付款后,Doris 系统需要将订单 id 为 '1' 的订单状态变更为 '待发货'。 + +#### 可以使用以下导入方式进行部分列更新 **StreamLoad/BrokerLoad/RoutineLoad** -如果使用的是 Stream Load/Broker Load/Routine Load,在导入时添加如下 header +准备如下 csv 文件: + +``` +1,待发货 +``` + +在导入时添加如下 header: ```sql partial_columns:true ``` -同时在`columns`中指定要导入的列(必须包含所有 key 列,不然无法更新), 下面是一个 Stream Load 的例子: +同时在 `columns` 中指定要导入的列(必须包含所有 key 列,否则无法更新)。下面是一个 Stream Load 的例子: ```sql -curl --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H "columns:order_id,order_status" -T /tmp/update.csv http://127.0.0.1:8030/api/db1/order_tbl/_stream_load +curl --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H "columns:order_id,order_status" -T /tmp/update.csv http://127.0.0.1:8030/api/db1/order_tbl/_stream_load ``` **INSERT INTO** -在所有的数据模型中,`INSERT INTO` 给定一部分列时默认行为都是整行写入,为了防止误用,在 Merge-on-Write 实现中,`INSERT INTO`默认仍然保持整行 UPSERT 的语义,如果需要开启部分列更新的语义,需要设置如下 session variable +在所有数据模型中,`INSERT INTO` 给定部分列时默认行为是整行写入。为了防止误用,在 Merge-on-Write 实现中,`INSERT INTO` 默认仍然保持整行 UPSERT 的语义。如果需要开启部分列更新的语义,需要设置如下 session variable: ```sql -SET enable_unique_key_partial_update=true -INSERT INTO order_tbl (order_id, order_status) VALUES (1,'待发货'); +SET enable_unique_key_partial_update=true; +INSERT INTO order_tbl (order_id, order_status) VALUES (1, '待发货'); ``` -需要注意的是,控制 insert 语句是否开启严格模式的会话变量`enable_insert_strict`的默认值为 true,即 insert 语句默认开启严格模式,而在严格模式下进行部分列更新不允许更新不存在的 key。所以,在使用 insert 语句进行部分列更新的时候如果希望能插入不存在的 key,需要在`enable_unique_key_partial_update`设置为 true 的基础上同时将`enable_insert_strict`设置为 false。 +需要注意的是,控制 insert 语句是否开启严格模式的会话变量 `enable_insert_strict` 的默认值为 true,即 insert 语句默认开启严格模式。在严格模式下进行部分列更新不允许更新不存在的 key。所以,在使用 insert 语句进行部分列更新时,如果希望能插入不存在的 key,需要在 `enable_unique_key_partial_update` 设置为 true 的基础上,同时将 `enable_insert_strict` 设置为 false。 **Flink Connector** -如果使用 Flink Connector, 需要添加如下配置: +如果使用 Flink Connector,需要添加如下配置: ```sql 'sink.properties.partial_columns' = 'true', ``` -同时在`sink.properties.column`中指定要导入的列(必须包含所有 key 列,不然无法更新) +同时在 `sink.properties.column` 中指定要导入的列(必须包含所有 key 列,否则无法更新)。 -### 示例 +#### 更新结果 -假设 Doris 中存在一张订单表 order_tbl,其中订单 id 是 Key 列,订单状态,订单金额是 Value 列。数据状态如下: - -| 订单 id | 订单金额 | 订单状态 | -| ------ | -------- | -------- | -| 1 | 100 | 待付款 | - -```sql -+----------+--------------+--------------+ -| order_id | order_amount | order_status | -+----------+--------------+--------------+ -| 1 | 100 | 待付款 | -+----------+--------------+--------------+ -1 row in set (0.01 sec) -``` - -这时候,用户点击付款后,Doris 系统需要将订单 id 为 '1' 的订单状态变更为 '待发货'。 - -使用`INSERT INTO`进行更新: - -```sql -SET enable_unique_key_partial_update=true; -INSERT INTO order_tbl (order_id, order_status) VALUES (1,'待发货'); -``` - -更新后结果如下 +更新后结果如下: ```sql +----------+--------------+--------------+ @@ -135,18 +125,17 @@ INSERT INTO order_tbl (order_id, order_status) VALUES (1,'待发货'); ### 使用注意 -由于 Merge-on-Write 实现需要在数据写入的时候,进行整行数据的补齐,以保证最优的查询性能,因此使用 Merge-on-Write 实现进行部分列更新会有部分导入性能下降。 +由于 Merge-on-Write 实现需要在数据写入时进行整行数据的补齐,以保证最优的查询性能,因此使用 Merge-on-Write 实现进行部分列更新会导致部分导入性能下降。 写入性能优化建议: -- 使用配备了 NVMe 的 SSD,或者极速 SSD 云盘。因为补齐数据时会大量的读取历史数据,产生较高的读 IOPS,以及读吞吐 - -- 开启行存将能够大大减少补齐数据时产生的 IOPS,导入性能提升明显,用户可以在建表时通过如下 property 来开启行存: +- 使用配备 NVMe 的 SSD,或者极速 SSD 云盘。因为补齐数据时会大量读取历史数据,产生较高的读 IOPS 以及读吞吐。 +- 开启行存能够大大减少补齐数据时产生的 IOPS,导入性能提升明显。用户可以在建表时通过如下 property 来开启行存: ```Plain "store_row_column" = "true" ``` -目前,同一批次数据写入任务(无论是导入任务还是`INSERT INTO`)的所有行只能更新相同的列,如果需要更新不同列的数据,则需要分不同的批次进行写入。 +目前,同一批次数据写入任务(无论是导入任务还是 `INSERT INTO`)的所有行只能更新相同的列。如果需要更新不同列的数据,则需要分不同批次进行写入。 在未来版本中,将支持灵活的列更新,用户可以在同一批导入中,每一行更新不同的列。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/update/update-of-unique-model.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/update/update-of-unique-model.md index ea038774cc..9fc8d2d911 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/update/update-of-unique-model.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/update/update-of-unique-model.md @@ -24,106 +24,95 @@ specific language governing permissions and limitations under the License. --> -这篇文档主要介绍 Doris 主键模型上基于导入的更新。 +这篇文档主要介绍 Doris 主键模型基于导入的更新。 ## 整行更新 -使用 Doris 支持的 Stream Load,Broker Load,Routine Load,Insert Into 等导入方式,往主键模型(Unique 模型)中进行数据导入时,如果没有相应主键的数据行,就执行插入新的数据,如果有相应主键的数据行,就进行更新。也就是说,Doris 主键模型的导入是一种“upsert”模式。基于导入,对已有记录的更新,默认和导入一个新记录是完全一样的,所以,这里可以参考数据导入的文档部分。 +使用 Doris 支持的 Stream Load、Broker Load、Routine Load、Insert Into 等导入方式,向主键模型(Unique 模型)导入数据时,如果没有相应主键的数据行,则插入新数据;如果有相应主键的数据行,则进行更新。也就是说,Doris 主键模型的导入是一种“upsert”模式。基于导入,对已有记录的更新,默认和导入一个新记录是完全一样的,因此可以参考数据导入的文档部分。 ## 部分列更新 -部分列更新,主要是指直接更新表中某些字段值,而不是全部的字段值。可以采用 Update 语句来进行更新,这种 Update 语句一般采用先将整行数据读出,然后再更新部分字段值,再写回。这种读写事务非常耗时,并且不适合大批量数据写入。Doris 在主键模型的导入更新中,提供了可以直接插入或者更新部分列数据的功能,不需要先读取整行数据,这样更新效率就大幅提升了。 +部分列更新是指直接更新表中某些字段值,而不是全部字段值。可以使用 Update 语句进行更新,这种 Update 语句通常先读取整行数据,然后更新部分字段值,再写回。这种读写事务非常耗时,不适合大批量数据写入。Doris 在主键模型的导入更新中,提供了直接插入或更新部分列数据的功能,不需要先读取整行数据,从而大幅提升更新效率。 :::caution 注意 -1. 2.0 版本仅在 Unique Key 的 Merge-on-Write 实现中支持了部分列更新能力 -2. 从 2.0.2 版本开始,支持使用 INSERT INTO 进行部分列更新 -3. 不支持在有同步物化视图的表上进行部分列更新 +1. 2.0 版本仅在 Unique Key 的 Merge-on-Write 实现中支持部分列更新能力。 +2. 从 2.0.2 版本开始,支持使用 INSERT INTO 进行部分列更新。 +3. 不支持在有同步物化视图的表上进行部分列更新。 + ::: ### 适用场景 -- 实时的动态列更新,需要在表中实时的高频更新某些字段值。例如用户标签表中有一些关于用户最新行为信息的字段需要实时的更新,以实现广告/推荐等系统能够据其进行实时的分析和决策。 - -- 将多张源表拼接成一张大宽表 - -- 数据修正 +- 实时动态列更新,需要在表中实时高频更新某些字段值。例如用户标签表中有一些关于用户最新行为信息的字段需要实时更新,以便广告/推荐系统能够据此进行实时分析和决策。 +- 将多张源表拼接成一张大宽表。 +- 数据修正。 -### 使用方式 +### 使用示例 -#### 建表 +假设 Doris 中存在一张订单表 order_tbl,其中订单 id 是 Key 列,订单状态和订单金额是 Value 列。数据状态如下: -建表时需要指定如下 property,以开启 Merge-on-Write 实现 +| 订单 id | 订单金额 | 订单状态 | +| ------ | -------- | -------- | +| 1 | 100 | 待付款 | ```sql -enable_unique_key_merge_on_write = true ++----------+--------------+--------------+ +| order_id | order_amount | order_status | ++----------+--------------+--------------+ +| 1 | 100 | 待付款 | ++----------+--------------+--------------+ +1 row in set (0.01 sec) ``` -#### 导入 +此时,用户点击付款后,Doris 系统需要将订单 id 为 '1' 的订单状态变更为 '待发货'。 + +#### 可以使用以下导入方式进行部分列更新 **StreamLoad/BrokerLoad/RoutineLoad** -如果使用的是 Stream Load/Broker Load/Routine Load,在导入时添加如下 header +准备如下 csv 文件: + +``` +1,待发货 +``` + +在导入时添加如下 header: ```sql partial_columns:true ``` -同时在`columns`中指定要导入的列(必须包含所有 key 列,不然无法更新), 下面是一个 Stream Load 的例子: +同时在 `columns` 中指定要导入的列(必须包含所有 key 列,否则无法更新)。下面是一个 Stream Load 的例子: ```sql -curl --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H "columns:order_id,order_status" -T /tmp/update.csv http://127.0.0.1:8030/api/db1/order_tbl/_stream_load +curl --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H "columns:order_id,order_status" -T /tmp/update.csv http://127.0.0.1:8030/api/db1/order_tbl/_stream_load ``` **INSERT INTO** -在所有的数据模型中,`INSERT INTO` 给定一部分列时默认行为都是整行写入,为了防止误用,在 Merge-on-Write 实现中,`INSERT INTO`默认仍然保持整行 UPSERT 的语义,如果需要开启部分列更新的语义,需要设置如下 session variable +在所有数据模型中,`INSERT INTO` 给定部分列时默认行为是整行写入。为了防止误用,在 Merge-on-Write 实现中,`INSERT INTO` 默认仍然保持整行 UPSERT 的语义。如果需要开启部分列更新的语义,需要设置如下 session variable: ```sql -SET enable_unique_key_partial_update=true -INSERT INTO order_tbl (order_id, order_status) VALUES (1,'待发货'); +SET enable_unique_key_partial_update=true; +INSERT INTO order_tbl (order_id, order_status) VALUES (1, '待发货'); ``` -需要注意的是,控制 insert 语句是否开启严格模式的会话变量`enable_insert_strict`的默认值为 true,即 insert 语句默认开启严格模式,而在严格模式下进行部分列更新不允许更新不存在的 key。所以,在使用 insert 语句进行部分列更新的时候如果希望能插入不存在的 key,需要在`enable_unique_key_partial_update`设置为 true 的基础上同时将`enable_insert_strict`设置为 false。 +需要注意的是,控制 insert 语句是否开启严格模式的会话变量 `enable_insert_strict` 的默认值为 true,即 insert 语句默认开启严格模式。在严格模式下进行部分列更新不允许更新不存在的 key。所以,在使用 insert 语句进行部分列更新时,如果希望能插入不存在的 key,需要在 `enable_unique_key_partial_update` 设置为 true 的基础上,同时将 `enable_insert_strict` 设置为 false。 **Flink Connector** -如果使用 Flink Connector, 需要添加如下配置: +如果使用 Flink Connector,需要添加如下配置: ```sql 'sink.properties.partial_columns' = 'true', ``` -同时在`sink.properties.column`中指定要导入的列(必须包含所有 key 列,不然无法更新) - - -### 示例 - -假设 Doris 中存在一张订单表 order_tbl,其中订单 id 是 Key 列,订单状态,订单金额是 Value 列。数据状态如下: - -| 订单 id | 订单金额 | 订单状态 | -| ------ | -------- | -------- | -| 1 | 100 | 待付款 | - -```sql -+----------+--------------+--------------+ -| order_id | order_amount | order_status | -+----------+--------------+--------------+ -| 1 | 100 | 待付款 | -+----------+--------------+--------------+ -1 row in set (0.01 sec) -``` - -这时候,用户点击付款后,Doris 系统需要将订单 id 为 '1' 的订单状态变更为 '待发货'。 +同时在 `sink.properties.column` 中指定要导入的列(必须包含所有 key 列,否则无法更新)。 -使用`INSERT INTO`进行更新: +#### 更新结果 -```sql -SET enable_unique_key_partial_update=true; -INSERT INTO order_tbl (order_id, order_status) VALUES (1,'待发货'); -``` - -更新后结果如下 +更新后结果如下: ```sql +----------+--------------+--------------+ @@ -136,18 +125,17 @@ INSERT INTO order_tbl (order_id, order_status) VALUES (1,'待发货'); ### 使用注意 -由于 Merge-on-Write 实现需要在数据写入的时候,进行整行数据的补齐,以保证最优的查询性能,因此使用 Merge-on-Write 实现进行部分列更新会有部分导入性能下降。 +由于 Merge-on-Write 实现需要在数据写入时进行整行数据的补齐,以保证最优的查询性能,因此使用 Merge-on-Write 实现进行部分列更新会导致部分导入性能下降。 写入性能优化建议: -- 使用配备了 NVMe 的 SSD,或者极速 SSD 云盘。因为补齐数据时会大量的读取历史数据,产生较高的读 IOPS,以及读吞吐 - -- 开启行存将能够大大减少补齐数据时产生的 IOPS,导入性能提升明显,用户可以在建表时通过如下 property 来开启行存: +- 使用配备 NVMe 的 SSD,或者极速 SSD 云盘。因为补齐数据时会大量读取历史数据,产生较高的读 IOPS 以及读吞吐。 +- 开启行存能够大大减少补齐数据时产生的 IOPS,导入性能提升明显。用户可以在建表时通过如下 property 来开启行存: ```Plain "store_row_column" = "true" ``` -目前,同一批次数据写入任务(无论是导入任务还是`INSERT INTO`)的所有行只能更新相同的列,如果需要更新不同列的数据,则需要分不同的批次进行写入。 +目前,同一批次数据写入任务(无论是导入任务还是 `INSERT INTO`)的所有行只能更新相同的列。如果需要更新不同列的数据,则需要分不同批次进行写入。 在未来版本中,将支持灵活的列更新,用户可以在同一批导入中,每一行更新不同的列。 diff --git a/versioned_docs/version-2.1/data-operate/update/update-of-unique-model.md b/versioned_docs/version-2.1/data-operate/update/update-of-unique-model.md index d5e8f22d9f..62124f9a7d 100644 --- a/versioned_docs/version-2.1/data-operate/update/update-of-unique-model.md +++ b/versioned_docs/version-2.1/data-operate/update/update-of-unique-model.md @@ -24,129 +24,118 @@ specific language governing permissions and limitations under the License. --> -This document primarily introduces the updates based on the load data on the Doris Unique Key model. +This document introduces how to update data in the Doris primary key model using various load methods. -## Full Row Update +## Whole Row Update -When loading data into the primary key model (Unique model) in Doris using supported load methods such as Stream Load, Broker Load, Routine Load, Insert Into, etc., if there are no corresponding data rows with the primary key, new data will be inserted. If there are corresponding data rows with the primary key, the data will be updated. In other words, loading data into the Doris primary key model follows an "upsert" mode. Based on the import, updating existing records is by default the [...] +When loading data into the primary key model (Unique model) using Doris-supported methods like Stream Load, Broker Load, Routine Load, Insert Into, etc., new data is inserted if there is no existing primary key data row. If there is an existing primary key data row, it is updated. This means the load operation in the Doris primary key model works in an "upsert" mode. The process of updating existing records is the same as loading new records by default, so you can refer to the data load [...] -## Partial Update +## Partial Column Update -Updating partial columns mainly refers to directly updating certain field values in a table instead of updating all field values. This can be done using the Update statement, which typically involves reading the entire row data, updating specific field values, and then writing it back. This read-write transaction is time-consuming and not suitable for writing large amounts of data. In the context of load updates on the primary key model, Doris provides a functionality to directly insert [...] +Partial column update allows you to update specific fields in a table without modifying all fields. You can use the Update statement to perform this operation, which typically involves reading the entire row, updating the desired fields, and writing it back. This read-write transaction is time-consuming and not suitable for large-scale data writing. Doris provides a feature to directly insert or update partial column data in the primary key model load update, bypassing the need to read t [...] -:::caution Note: +:::caution Note -1. Partial updates are only supported in the Merge-on-Write implementation of the Unique Key starting from version 2.0. -2. Starting from version 2.0.2, partial updates are supported using INSERT INTO. -3. Partial updates are not supported on tables with materialized views. -::: - -### Use Cases +1. Version 2.0 only supports partial column updates in the Merge-on-Write implementation of the Unique Key. +2. Starting from version 2.0.2, partial column updates are supported using INSERT INTO. +3. Partial column updates are not supported on tables with synchronized materialized views. -- Real-time dynamic column updates that require high-frequency updates on certain fields in the table. For example, in a user tag table, there are fields containing the latest user behavior information that needs real-time updates to enable real-time analysis and decision-making in advertising/recommendation systems. +::: -- Combining multiple source tables into a large denormalized table. +### Applicable Scenarios +- Real-time dynamic column updates, requiring frequent updates of specific fields in the table. For example, updating fields related to the latest user behavior in a user tag table for real-time analysis and decision-making in advertising/recommendation systems. +- Merging multiple source tables into one large wide table. - Data correction. -### Usage +### Usage Example -#### Table Creation +Assume there is an order table `order_tbl` in Doris, where the order id is the Key column, and the order status and order amount are the Value columns. The data status is as follows: -When creating the table, the following property needs to be specified to enable the Merge-on-Write implementation: +| Order id | Order Amount | Order Status | +| -------- | -------------| -------------| +| 1 | 100 | Pending Payment | ```sql -enable_unique_key_merge_on_write = true ++----------+--------------+--------------+ +| order_id | order_amount | order_status | ++----------+--------------+--------------+ +| 1 | 100 | Pending Payment | ++----------+--------------+--------------+ +1 row in set (0.01 sec) ``` -#### Load +After the user clicks to pay, the Doris system needs to change the order status of the order with order id '1' to 'Pending Shipment'. + +#### Partial Column Update Using Load Methods **StreamLoad/BrokerLoad/RoutineLoad** -If you are using Stream Load/Broker Load/Routine Load, add the following header during the load: +Prepare the following CSV file: -```sql -partial_columns: true +``` +1,Pending Shipment ``` -Also, specify the columns to be loaded in the `columns` section (all key columns must be included, otherwise updates won't be possible). Below is an example of Stream Load: +Add the following header during load: ```sql -$ curl --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H "columns:order_id,order_status" -T /tmp/update.csv http://127.0.0.1:8030/api/db1/order_tbl/_stream_load +partial_columns:true ``` -**INSERT INTO** - -In all data models, when using `INSERT INTO` with a subset of columns, the default behavior is to insert the entire row. To enable partial column updates in the Merge-on-Write implementation, the following session variable needs to be set: +Specify the columns to be loaded in `columns` (must include all key columns). Below is an example of Stream Load: ```sql -SET enable_unique_key_partial_update=true -INSERT INTO order_tbl (order_id, order_status) VALUES (1,'To be shipped'); +curl --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H "columns:order_id,order_status" -T /tmp/update.csv http://127.0.0.1:8030/api/db1/order_tbl/_stream_load ``` -Note that the default value for the session variable `enable_insert_strict`, which controls whether the insert statement operates in strict mode, is true. In strict mode, updating non-existing keys during partial column updates is not allowed. So, if you want to insert non-existing keys during partial column updates using the insert statement, you need to set `enable_unique_key_partial_update` to true and also set `enable_insert_strict` to false. - -**Flink Connector** +**INSERT INTO** -If you are using the Flink Connector, add the following configuration: +In all data models, the default behavior of `INSERT INTO` when given partial columns is to write the entire row. To prevent misuse, in the Merge-on-Write implementation, `INSERT INTO` maintains the semantics of whole row UPSERT by default. To enable partial column updates, set the following session variable: ```sql -'sink.properties.partial_columns' = 'true', +SET enable_unique_key_partial_update=true; +INSERT INTO order_tbl (order_id, order_status) VALUES (1, 'Pending Shipment'); ``` -Also, specify the columns to be loaded in `sink.properties.column` (all key columns must be included, otherwise updates won't be possible). +Note that the session variable `enable_insert_strict` defaults to true, enabling strict mode by default. In strict mode, partial column updates do not allow updating non-existent keys. To insert non-existent keys using the insert statement for partial column updates, set `enable_unique_key_partial_update` to true and `enable_insert_strict` to false. -### Example - -Suppose there is an order table named `order_tbl` in Doris, where the order ID is a key column, and the order status and order amount are value columns. The data is as follows: +**Flink Connector** -| Order ID | Order Amount | Order Status | -| -------- | ------------ | ------------ | -| 1 | 100 | Pending | +If using Flink Connector, add the following configuration: ```sql -+----------+--------------+--------------+ -| order_id | order_amount | order_status | -+----------+--------------+--------------+ -| 1 | 100 | Pending | -+----------+--------------+--------------+ -1 row in set (0.01 sec) +'sink.properties.partial_columns' = 'true', ``` -Now, when a user clicks on payment, the Doris system needs to update the order status of the order with ID '1' to 'To be shipped'. +Specify the columns to be loaded in `sink.properties.column` (must include all key columns). -We use `INSERT INTO` to update the data. +#### Update Result -```sql -SET enable_unique_key_partial_update=true; -INSERT INTO order_tbl (order_id, order_status) VALUES (1,'To be shipped'); -``` - -After the update, the result is as follows: +The result after the update is as follows: ```sql +----------+--------------+--------------+ | order_id | order_amount | order_status | +----------+--------------+--------------+ -| 1 | 100 | To be shipped | +| 1 | 100 | Pending Shipment | +----------+--------------+--------------+ 1 row in set (0.01 sec) ``` -### Notes - -Due to the Merge-on-Write implementation requiring data completion during data writing to ensure optimal query performance, performing partial column updates using the Merge-on-Write implementation may result in a decrease in load performance. +### Usage Notes -Suggestions for improving load performance: +Since the Merge-on-Write implementation needs to complete the entire row of data during writing to ensure optimal query performance, using it for partial column updates may decrease partial load performance. -- Use SSDs equipped with NVMe or high-speed SSD cloud disks. Reading historical data in large quantities during data completion will generate high read IOPS and read throughput. +Performance optimization suggestions: -- Enabling row storage can significantly reduce the IOPS generated during data completion, resulting in noticeable improvements in load performance. Users can enable row storage by using the following property when creating the table: +- Use SSDs equipped with NVMe or high-speed SSD cloud disks, as completing data will read a large amount of historical data, generating high read IOPS and throughput. +- Enabling row storage can reduce the IOPS generated when completing data, significantly improving load performance. Enable row storage by setting the following property when creating a table: ```Plain "store_row_column" = "true" ``` -Now, all rows in a batch write task (whether it is a load task or `INSERT INTO`) can only update the same columns. If you need to update different columns, you will need to perform separate batch writes. +Currently, all rows in the same batch data writing task (whether a load task or `INSERT INTO`) can only update the same columns. To update data with different columns, write in different batches. -In the future, flexible column updates will be supported, allowing users to update different columns for each row within the same batch load. +Future versions will support flexible column updates, allowing users to update different columns for each row in the same batch load. diff --git a/versioned_docs/version-3.0/data-operate/update/update-of-unique-model.md b/versioned_docs/version-3.0/data-operate/update/update-of-unique-model.md index 5fdc92bd0c..62124f9a7d 100644 --- a/versioned_docs/version-3.0/data-operate/update/update-of-unique-model.md +++ b/versioned_docs/version-3.0/data-operate/update/update-of-unique-model.md @@ -24,129 +24,118 @@ specific language governing permissions and limitations under the License. --> -This document primarily introduces the updates based on the load data on the Doris Unique Key model. +This document introduces how to update data in the Doris primary key model using various load methods. -## Full Row Update +## Whole Row Update -When loading data into the primary key model (Unique model) in Doris using supported load methods such as Stream Load, Broker Load, Routine Load, Insert Into, etc., if there are no corresponding data rows with the primary key, new data will be inserted. If there are corresponding data rows with the primary key, the data will be updated. In other words, loading data into the Doris primary key model follows an "upsert" mode. Based on the import, updating existing records is by default the [...] +When loading data into the primary key model (Unique model) using Doris-supported methods like Stream Load, Broker Load, Routine Load, Insert Into, etc., new data is inserted if there is no existing primary key data row. If there is an existing primary key data row, it is updated. This means the load operation in the Doris primary key model works in an "upsert" mode. The process of updating existing records is the same as loading new records by default, so you can refer to the data load [...] -## Partial Update +## Partial Column Update -Updating partial columns mainly refers to directly updating certain field values in a table instead of updating all field values. This can be done using the Update statement, which typically involves reading the entire row data, updating specific field values, and then writing it back. This read-write transaction is time-consuming and not suitable for writing large amounts of data. In the context of load updates on the primary key model, Doris provides functionality to directly insert or [...] +Partial column update allows you to update specific fields in a table without modifying all fields. You can use the Update statement to perform this operation, which typically involves reading the entire row, updating the desired fields, and writing it back. This read-write transaction is time-consuming and not suitable for large-scale data writing. Doris provides a feature to directly insert or update partial column data in the primary key model load update, bypassing the need to read t [...] -:::caution Note: +:::caution Note -1. Partial updates are only supported in the Merge-on-Write implementation of the Unique Key starting from version 2.0. -2. Starting from version 2.0.2, partial updates are supported using INSERT INTO. -3. Partial updates are not supported on tables with materialized views. -::: - -### Use Cases +1. Version 2.0 only supports partial column updates in the Merge-on-Write implementation of the Unique Key. +2. Starting from version 2.0.2, partial column updates are supported using INSERT INTO. +3. Partial column updates are not supported on tables with synchronized materialized views. -- Real-time dynamic column updates that require high-frequency updates on certain fields in the table. For example, in a user tag table, there are fields containing the latest user behavior information that needs real-time updates to enable real-time analysis and decision-making in advertising/recommendation systems. +::: -- Combining multiple source tables into a large denormalized table. +### Applicable Scenarios +- Real-time dynamic column updates, requiring frequent updates of specific fields in the table. For example, updating fields related to the latest user behavior in a user tag table for real-time analysis and decision-making in advertising/recommendation systems. +- Merging multiple source tables into one large wide table. - Data correction. -### Usage +### Usage Example -#### Table Creation +Assume there is an order table `order_tbl` in Doris, where the order id is the Key column, and the order status and order amount are the Value columns. The data status is as follows: -When creating the table, the following property needs to be specified to enable the Merge-on-Write implementation: +| Order id | Order Amount | Order Status | +| -------- | -------------| -------------| +| 1 | 100 | Pending Payment | ```sql -enable_unique_key_merge_on_write = true ++----------+--------------+--------------+ +| order_id | order_amount | order_status | ++----------+--------------+--------------+ +| 1 | 100 | Pending Payment | ++----------+--------------+--------------+ +1 row in set (0.01 sec) ``` -#### Load +After the user clicks to pay, the Doris system needs to change the order status of the order with order id '1' to 'Pending Shipment'. + +#### Partial Column Update Using Load Methods **StreamLoad/BrokerLoad/RoutineLoad** -If you are using Stream Load/Broker Load/Routine Load, add the following header during the load: +Prepare the following CSV file: -```sql -partial_columns: true +``` +1,Pending Shipment ``` -Also, specify the columns to be loaded in the `columns` section (all key columns must be included, otherwise updates won't be possible). Below is an example of Stream Load: +Add the following header during load: ```sql -$ curl --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H "columns:order_id,order_status" -T /tmp/update.csv http://127.0.0.1:8030/api/db1/order_tbl/_stream_load +partial_columns:true ``` -**INSERT INTO** - -In all data models, when using `INSERT INTO` with a subset of columns, the default behavior is to insert the entire row. To enable partial column updates in the Merge-on-Write implementation, the following session variable needs to be set: +Specify the columns to be loaded in `columns` (must include all key columns). Below is an example of Stream Load: ```sql -SET enable_unique_key_partial_update=true -INSERT INTO order_tbl (order_id, order_status) VALUES (1,'To be shipped'); +curl --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H "columns:order_id,order_status" -T /tmp/update.csv http://127.0.0.1:8030/api/db1/order_tbl/_stream_load ``` -Note that the default value for the session variable `enable_insert_strict`, which controls whether the insert statement operates in strict mode, is true. In strict mode, updating non-existing keys during partial column updates is not allowed. So, if you want to insert non-existing keys during partial column updates using the insert statement, you need to set `enable_unique_key_partial_update` to true and also set `enable_insert_strict` to false. - -**Flink Connector** +**INSERT INTO** -If you are using the Flink Connector, add the following configuration: +In all data models, the default behavior of `INSERT INTO` when given partial columns is to write the entire row. To prevent misuse, in the Merge-on-Write implementation, `INSERT INTO` maintains the semantics of whole row UPSERT by default. To enable partial column updates, set the following session variable: ```sql -'sink.properties.partial_columns' = 'true', +SET enable_unique_key_partial_update=true; +INSERT INTO order_tbl (order_id, order_status) VALUES (1, 'Pending Shipment'); ``` -Also, specify the columns to be loaded in `sink.properties.column` (all key columns must be included, otherwise updates won't be possible). +Note that the session variable `enable_insert_strict` defaults to true, enabling strict mode by default. In strict mode, partial column updates do not allow updating non-existent keys. To insert non-existent keys using the insert statement for partial column updates, set `enable_unique_key_partial_update` to true and `enable_insert_strict` to false. -### Example - -Suppose there is an order table named `order_tbl` in Doris, where the order ID is a key column, and the order status and order amount are value columns. The data is as follows: +**Flink Connector** -| Order ID | Order Amount | Order Status | -| -------- | ------------ | ------------ | -| 1 | 100 | Pending | +If using Flink Connector, add the following configuration: ```sql -+----------+--------------+--------------+ -| order_id | order_amount | order_status | -+----------+--------------+--------------+ -| 1 | 100 | Pending | -+----------+--------------+--------------+ -1 row in set (0.01 sec) +'sink.properties.partial_columns' = 'true', ``` -Now, when a user clicks on payment, the Doris system needs to update the order status of the order with ID '1' to 'To be shipped'. +Specify the columns to be loaded in `sink.properties.column` (must include all key columns). -We use `INSERT INTO` to update the data. +#### Update Result -```sql -SET enable_unique_key_partial_update=true; -INSERT INTO order_tbl (order_id, order_status) VALUES (1,'To be shipped'); -``` - -After the update, the result is as follows: +The result after the update is as follows: ```sql +----------+--------------+--------------+ | order_id | order_amount | order_status | +----------+--------------+--------------+ -| 1 | 100 | To be shipped | +| 1 | 100 | Pending Shipment | +----------+--------------+--------------+ 1 row in set (0.01 sec) ``` -### Notes - -Due to the Merge-on-Write implementation requiring data completion during data writing to ensure optimal query performance, performing partial column updates using the Merge-on-Write implementation may result in a decrease in load performance. +### Usage Notes -Suggestions for improving load performance: +Since the Merge-on-Write implementation needs to complete the entire row of data during writing to ensure optimal query performance, using it for partial column updates may decrease partial load performance. -- Use SSDs equipped with NVMe or high-speed SSD cloud disks. Reading historical data in large quantities during data completion will generate high read IOPS and read throughput. +Performance optimization suggestions: -- Enabling row storage can significantly reduce the IOPS generated during data completion, resulting in noticeable improvements in load performance. Users can enable row storage by using the following property when creating the table: +- Use SSDs equipped with NVMe or high-speed SSD cloud disks, as completing data will read a large amount of historical data, generating high read IOPS and throughput. +- Enabling row storage can reduce the IOPS generated when completing data, significantly improving load performance. Enable row storage by setting the following property when creating a table: ```Plain "store_row_column" = "true" ``` -Now, all rows in a batch write task (whether it is a load task or `INSERT INTO`) can only update the same columns. If you need to update different columns, you will need to perform separate batch writes. +Currently, all rows in the same batch data writing task (whether a load task or `INSERT INTO`) can only update the same columns. To update data with different columns, write in different batches. -In the future, flexible column updates will be supported, allowing users to update different columns for each row within the same batch load. +Future versions will support flexible column updates, allowing users to update different columns for each row in the same batch load. --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org