This is an automated email from the ASF dual-hosted git repository. zhangchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 79713b7e80 [doc](update) address comment of unique-update-concurrent-control (#1760) 79713b7e80 is described below commit 79713b7e8058a393e790ebb2abdb7339d4852c05 Author: zhannngchen <zhangc...@selectdb.com> AuthorDate: Tue Jan 14 14:07:47 2025 +0800 [doc](update) address comment of unique-update-concurrent-control (#1760) ## Versions - [x] dev - [x] 3.0 - [x] 2.1 - [ ] 2.0 ## Languages - [x] Chinese - [x] English ## Docs Checklist - [x] Checked by AI - [x] Test Cases Built --- .../update/unique-update-concurrent-control.md | 224 +++++++++----------- .../update/unique-update-concurrent-control.md | 186 +++++++---------- .../update/unique-update-concurrent-control.md | 184 +++++++---------- .../update/unique-update-concurrent-control.md | 184 +++++++---------- .../update/unique-update-concurrent-control.md | 226 +++++++++------------ .../update/unique-update-concurrent-control.md | 226 +++++++++------------ 6 files changed, 519 insertions(+), 711 deletions(-) diff --git a/docs/data-operate/update/unique-update-concurrent-control.md b/docs/data-operate/update/unique-update-concurrent-control.md index 68561a2022..08f9f836c6 100644 --- a/docs/data-operate/update/unique-update-concurrent-control.md +++ b/docs/data-operate/update/unique-update-concurrent-control.md @@ -26,69 +26,86 @@ under the License. ## Overview -Doris adopts a Multi-Version Concurrency Control (MVCC) mechanism to manage concurrent updates. Each data load operation is assigned a transaction, which ensures atomicity (i.e., the operation either fully succeeds or completely fails). Upon transaction commit, the system assigns a version number. When using the Unique Key model and loading multiple batches of data with duplicate primary keys, Doris determines the overwrite order based on the version number: data with a higher version nu [...] +Doris employs Multi-Version Concurrency Control (MVCC) to handle concurrent updates. Each data write operation is assigned a write transaction, ensuring atomicity (i.e., the write operation either fully succeeds or fully fails). Upon committing the write transaction, the system assigns it a version number. In the Unique Key model, when loading data multiple times, if there are duplicate primary keys, Doris determines the overwrite order based on the version number: data with a higher ver [...] -In certain scenarios, users may need to specify a sequence column in the table creation statement to customize the order in which data takes effect. For example, when synchronizing data into Doris using multiple concurrent processes, the data may arrive out of order. This could lead to older data overwriting newer data due to its delayed arrival. To address this, users can assign a lower sequence value to the older data and a higher sequence value to the newer data, enabling Doris to det [...] +In some scenarios, users may need to adjust the effective order of data by specifying a sequence column in the table creation statement. For example, when synchronizing data to Doris concurrently through multiple threads, data from different threads may arrive out of order. In this case, old data arriving later may incorrectly overwrite new data. To solve this problem, users can assign a lower sequence value to old data and a higher sequence value to new data, allowing Doris to correctly [...] -Additionally, `UPDATE` statements differ significantly from updates performed via data loads at the implementation level. An `UPDATE` operation involves two steps: reading the data to be updated from the database and writing the updated data back. By default, `UPDATE` statements use table-level locks to provide transaction capabilities with Serializable isolation, meaning multiple `UPDATE` statements must be executed serially. However, users can bypass this restriction by modifying the c [...] +Additionally, the `UPDATE` statement differs significantly from updates implemented through data loads at the underlying mechanism level. The `UPDATE` operation involves two steps: reading the data to be updated from the database and writing the updated data. By default, the `UPDATE` statement provides transaction capabilities with Serializable isolation level through table-level locks, meaning multiple `UPDATE` operations can only be executed serially. Users can also bypass this restric [...] ## UPDATE Concurrency Control -By default, concurrent `UPDATE`s on the same table are not allowed in Doris. +By default, concurrent `UPDATE` operations on the same table are not allowed. -The main reason is that Doris currently supports row-level updates, which means that even if the user specifies to update only a specific column (e.g., `SET v2 = 1`), all other value columns will be overwritten as well (even though their values remain unchanged). +The main reason is that Doris currently supports row updates, which means that even if the user declares `SET v2 = 1`, all other value columns will also be overwritten (even if the values have not changed). -This poses a problem when multiple `UPDATE` operations are performed concurrently on the same row. The behavior becomes unpredictable, and it may lead to inconsistent or "dirty" data. +This can lead to a problem where if two `UPDATE` operations update the same row simultaneously, the behavior may be indeterminate, potentially resulting in dirty data. -However, in practical applications, if the user can ensure that concurrent updates will not affect the same row simultaneously, they can manually enable the concurrent update restriction. This can be done by modifying the FE (Frontend) configuration `enable_concurrent_update`. When this configuration is set to `true`, the update command will not have transaction guarantees. - -:::caution Caution: -Enabling the `enable_concurrent_update` configuration may introduce certain performance risks. -::: +However, in practical applications, if users can ensure that concurrent updates do not operate on the same row simultaneously, they can manually enable concurrent updates. By modifying the FE configuration `enable_concurrent_update`, setting this configuration value to `true` will disable transaction guarantees for update commands. ## Sequence Column -The Unique model primarily caters to scenarios that require unique primary keys, ensuring the uniqueness constraint. When loading data in the same batch or different batches, the replacement order is not guaranteed. The uncertainty in the replacement order results in ambiguity in the specific data loaded into the table. +The Unique model is mainly for scenarios requiring unique primary keys, ensuring the uniqueness constraint of the primary key. The replacement order of data loaded in the same batch or different batches is not guaranteed. Without a guaranteed replacement order, the specific data ultimately loaded into the table is uncertain. -To address this issue, Doris supports sequence columns. Users can specify a sequence column during data load, allowing the replacement order to be controlled by the user. The sequence column determines the order of replacements for rows with the same key column. A higher sequence value can replace a lower one, but not vice versa. This method delegates the determination of order to the user, enabling control over the replacement sequence. +To solve this problem, Doris supports sequence columns. By specifying a sequence column during loading, data with the same key column is replaced based on the sequence column value, with larger values replacing smaller ones, and vice versa. This method allows users to control the replacement order. -:::note -Sequence columns are currently supported only in the Unique model. -::: +In implementation, Doris adds a hidden column **__DORIS_SEQUENCE_COL__**, whose type is specified by the user during table creation. The specific value of this column is determined during data loading, and the effective row for the same key column is decided based on this value. -### Basic Principles - -The basic principle is achieved by adding a hidden column called **__DORIS_SEQUENCE_COL__**. The type of this column is specified by the user during table creation and its specific value is determined during data load. Based on this value, the row that takes effect is determined for rows with the same key column. +:::caution Note +The sequence column currently only supports the Unique model. +::: -**Table Creation** +### Enabling Sequence Column Support -When creating a Unique table, an automatically added hidden column called __DORIS_SEQUENCE_COL__ is created, based on the user-specified type. +When creating a new table, if `function_column.sequence_col` or `function_column.sequence_type` is set, the new table will support sequence columns. -**Data load** +For a table that does not support sequence columns, you can enable this feature using the following statement: `ALTER TABLE example_db.my_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")`. -During data load, the FE (Frontend) sets the value of the hidden column as the value of the `ORDER BY` expression (for broker load and routine load) or the value of the `function_column.sequence_col` expression (for stream load). The value column is replaced based on this sequence value. The value of the hidden column, `__DORIS_SEQUENCE_COL__`, can be set as a column in the data source or a column in the table structure. +To check if a table supports sequence columns, you can set a session variable to display hidden columns `SET show_hidden_columns=true`, then use `desc tablename`. If the output includes the `__DORIS_SEQUENCE_COL__` column, it is supported; otherwise, it is not. -### Syntax Usage +### Usage Example -**Sequence Column has two ways to create a table, one is to set the `sequence_col` attribute when creating a table, and the other is to set the `sequence_type` attribute when creating a table.** +Below is an example of using Stream Load: -**1. Set `sequence_col` (Recommended)** +**1. Create a table supporting sequence columns** -When creating a Unique table, specify the mapping of the sequence column to other columns in the table. +Create a unique model `test_table` and map the sequence column to the `modify_date` column. -```Plain -PROPERTIES ( - "function_column.sequence_col" = 'column_name', +```sql +CREATE TABLE test.test_table +( + user_id bigint, + date date, + group_id bigint, + modify_date date, + keyword VARCHAR(128) +) +UNIQUE KEY(user_id, date, group_id) +DISTRIBUTED BY HASH (user_id) BUCKETS 32 +PROPERTIES( + "function_column.sequence_col" = 'modify_date', + "replication_num" = "1", + "in_memory" = "false" ); ``` -`sequence_col` is used to specify the mapping of the sequence column to a column in the table. The column can be of type integer or time type (DATE, DATETIME), and its type cannot be changed after creation. +The `sequence_col` specifies the mapping of the sequence column to a column in the table. This column can be of integer or date/time type (DATE, DATETIME) and cannot be changed after creation. -The load method is the same as when there is no sequence column, making it relatively simple. This method is recommended. +The table structure is as follows: -**2. Set `sequence_type`** +```sql +MySQL> desc test_table; ++-------------+--------------+------+-------+---------+---------+ +| Field | Type | Null | Key | Default | Extra | ++-------------+--------------+------+-------+---------+---------+ +| user_id | BIGINT | No | true | NULL | | +| date | DATE | No | true | NULL | | +| group_id | BIGINT | No | true | NULL | | +| modify_date | DATE | No | false | NULL | REPLACE | +| keyword | VARCHAR(128) | No | false | NULL | REPLACE | ++-------------+--------------+------+-------+---------+---------+ +``` -When creating a Unique table, specify the type of the sequence column. +In addition to specifying the sequence column through column mapping, Doris also supports creating a sequence column based on a specified type. This method does not require a column in the schema for mapping. The syntax is as follows: ```Plain PROPERTIES ( @@ -96,13 +113,45 @@ PROPERTIES ( ); ``` -`sequence_type` is used to specify the type of the sequence column, which can be integer or time type (DATE, DATETIME). +The `sequence_type` specifies the type of the sequence column, which can be integer or date/time type (DATE, DATETIME). + +**2. Load Data:** + +Using column mapping (`function_column.sequence_col`) to specify the sequence column does not require modifying any parameters. Below is an example of loading data using Stream Load: + +```Plain +1 2020-02-22 1 2020-02-21 a +1 2020-02-22 1 2020-02-22 b +1 2020-02-22 1 2020-03-05 c +1 2020-02-22 1 2020-02-26 d +1 2020-02-22 1 2020-02-23 e +1 2020-02-22 1 2020-02-24 b +``` + +Stream load command: + +```shell +curl --location-trusted -u root: -T testData http://host:port/api/test/test_table/_stream_load +``` + +Result: + +```sql +MySQL> select * from test_table; ++---------+------------+----------+-------------+---------+ +| user_id | date | group_id | modify_date | keyword | ++---------+------------+----------+-------------+---------+ +| 1 | 2020-02-22 | 1 | 2020-03-05 | c | ++---------+------------+----------+-------------+---------+ +``` + +In this load job, the value '2020-03-05' in the sequence column (modify_date) is the largest, so the keyword column retains 'c'. -**During data load, you need to specify the mapping of the sequence column to other columns.** +If the sequence column is specified using `function_column.sequence_col` during table creation, the sequence column mapping must be specified during load. **1. Stream Load** -The syntax for stream load is to add the mapping of the hidden column `function_column.sequence_col` to the `source_sequence` in the header. Example: +In Stream Load, specify the sequence column mapping in the header: ```shell curl --location-trusted -u root -H "columns: k1,k2,source_sequence,v1,v2" -H "function_column.sequence_col: source_sequence" -T testData http://host:port/api/testDb/testTbl/_stream_load @@ -110,7 +159,7 @@ curl --location-trusted -u root -H "columns: k1,k2,source_sequence,v1,v2" -H "fu **2. Broker Load** -Set the mapping of the hidden column `source_sequence` in the `ORDER BY` clause. +Set the hidden column mapping in the `ORDER BY` clause: ```sql LOAD LABEL db1.label1 @@ -159,88 +208,7 @@ CREATE ROUTINE LOAD example_db.test1 ON example_tbl ); ``` -### Enabling Sequence Column Support - -If `function_column.sequence_col` or `function_column.sequence_type` is set when creating a new table, the new table will support sequence columns. - -For a table that does not support sequence columns, if you want to use this feature, you can use the following statement: `ALTER TABLE example_db.my_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")` to enable it. - -If you are unsure whether a table supports sequence columns, you can set a session variable to display hidden columns with `SET show_hidden_columns=true`, and then use `desc tablename`. If the output includes the `__DORIS_SEQUENCE_COL__` column, it means that the table supports sequence columns; otherwise, it does not. - -### Usage Example - -Here is an example using Stream Load to demonstrate the usage: - -**1. Create a table with sequence col support** - -Create a unique model `test_table` and specify the sequence column mapping to the `modify_date` column in the table. - -```sql -CREATE TABLE test.test_table -( - user_id bigint, - date date, - group_id bigint, - modify_date date, - keyword VARCHAR(128) -) -UNIQUE KEY(user_id, date, group_id) -DISTRIBUTED BY HASH (user_id) BUCKETS 32 -PROPERTIES( - "function_column.sequence_col" = 'modify_date', - "replication_num" = "1", - "in_memory" = "false" -); -``` - -Table structure: - -```sql -MySQL> desc test_table; -+-------------+--------------+------+-------+---------+---------+ -| Field | Type | Null | Key | Default | Extra | -+-------------+--------------+------+-------+---------+---------+ -| user_id | BIGINT | No | true | NULL | | -| date | DATE | No | true | NULL | | -| group_id | BIGINT | No | true | NULL | | -| modify_date | DATE | No | false | NULL | REPLACE | -| keyword | VARCHAR(128) | No | false | NULL | REPLACE | -+-------------+--------------+------+-------+---------+---------+ -``` - -**2. Load data normally:** - -Load the following data: - -```Plain -1 2020-02-22 1 2020-02-21 a -1 2020-02-22 1 2020-02-22 b -1 2020-02-22 1 2020-03-05 c -1 2020-02-22 1 2020-02-26 d -1 2020-02-22 1 2020-02-23 e -1 2020-02-22 1 2020-02-24 b -``` - -Here is an example using Stream Load: - -```shell -curl --location-trusted -u root: -T testData http://host:port/api/test/test_table/_stream_load -``` - -The result is: - -```sql -MySQL> select * from test_table; -+---------+------------+----------+-------------+---------+ -| user_id | date | group_id | modify_date | keyword | -+---------+------------+----------+-------------+---------+ -| 1 | 2020-02-22 | 1 | 2020-03-05 | c | -+---------+------------+----------+-------------+---------+ -``` - -In the data load, because the value of the sequence column (i.e., modify_date) '2020-03-05' is the maximum, the keyword column retains the value 'c'. - -**3. Guarantee the order of replacement** +**3. Ensuring Replacement Order** After completing the above steps, load the following data: @@ -249,7 +217,7 @@ After completing the above steps, load the following data: 1 2020-02-22 1 2020-02-23 b ``` -Query the data: +Query data: ```sql MySQL [test]> select * from test_table; @@ -260,16 +228,16 @@ MySQL [test]> select * from test_table; +---------+------------+----------+-------------+---------+ ``` -In the loaded data, the sequence column (modify_date) of all previously loaded data is compared, and '2020-03-05' is the maximum. Therefore, the keyword column retains the value 'c'. +In this load, the sequence column value '2020-03-05' is the largest, so the keyword column retains 'c'. -**4. Try loading the following data again** +**4. Try Loading the Following Data** ```Plain 1 2020-02-22 1 2020-02-22 a 1 2020-02-22 1 2020-03-23 w ``` -Query the data: +Query data: ```sql MySQL [test]> select * from test_table; @@ -280,16 +248,16 @@ MySQL [test]> select * from test_table; +---------+------------+----------+-------------+---------+ ``` -Now the original data in the table can be replaced. In summary, during thestream load process, the sequence column is used to determine the order of replacement for duplicate records. The record with the maximum value in the sequence column will be retained in the table. +This time, the data in the table is replaced. In summary, during the load process, the sequence column values of all batches are compared, and the record with the largest value is loaded into the Doris table. ### Note -1. To prevent misuse, users must explicitly specify the sequence column in loading tasks such as StreamLoad/BrokerLoad and in insert statements for row updates (unless the default value of the sequence column is CURRENT_TIMESTAMP). Otherwise, the following error message will be received: +1. To prevent misuse, in StreamLoad/BrokerLoad load tasks and row update insert statements, users must explicitly specify the sequence column (unless the default value of the sequence column is CURRENT_TIMESTAMP), otherwise, the following error message will be received: ```Plain Table test_tbl has sequence column, need to specify the sequence column ``` -2. Since version 2.0, Doris has supported partial column updates for Merge-on-Write implementation of Unique Key tables. In partial column update, users can update only a subset of columns each time, so it is not necessary to include the sequence column. If the loading task submitted by the user includes the sequence column, it has no effect. If the loading task submitted by the user does not include the sequence column, Doris will use the value of the matching sequence column from the h [...] +2. Since version 2.0, Doris supports partial column update capability for Unique Key tables with Merge-on-Write implementation. In partial column update loads, users can update only a portion of the columns each time, so it is not necessary to include the sequence column. If the load task submitted by the user includes the sequence column, the behavior is unaffected; if the load task does not include the sequence column, Doris will use the sequence column from the matching historical dat [...] -3. In cases of concurrent data load, Doris utilizes MVCC (Multi-Version Concurrency Control) mechanism to ensure data correctness. If two batches of loaded data update different columns of the same key, the load task with a higher system version will reapply the data for the same key written by the load task with a lower version after the lower version load task succeeds. +3. During concurrent loads, Doris uses the MVCC mechanism to ensure data correctness. If two batches of data loads update different columns of the same key, the load task with the higher system version will use the data row written by the lower version load task to fill in the same key after the lower version load task succeeds. diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/update/unique-update-concurrent-control.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/update/unique-update-concurrent-control.md index f04297dceb..fdf269073f 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/update/unique-update-concurrent-control.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/update/unique-update-concurrent-control.md @@ -28,7 +28,7 @@ under the License. Doris 采用多版本并发控制机制(MVCC - Multi-Version Concurrency Control)来管理并发更新。每次数据写入操作均会分配一个写入事务,该事务确保数据写入的原子性(即写入操作要么完全成功,要么完全失败)。在写入事务提交时,系统会为其分配一个版本号。当用户使用 Unique Key 模型并多次导入数据时,如果存在重复主键,Doris 会根据版本号确定覆盖顺序:版本号较高的数据会覆盖版本号较低的数据。 -在某些场景中,用户可能需要通过在建表语句中指定 sequence 列来灵活调整数据的生效顺序。例如,当通过多线程并发同步数据到 Doris 时,不同线程的数据可能会乱序到达。这种情况下,可能出现旧数据因较晚到达而错误覆盖新数据的情况。为解决这一问题,用户可以为旧数据指定较低的 sequence 值,为新数据指定较高的 sequence 值,从而让 Doris 根据用户提供的 sequence值来正确确定数据的更新顺序。 +在某些场景中,用户可能需要通过在建表语句中指定 sequence 列来灵活调整数据的生效顺序。例如,当通过多线程并发同步数据到 Doris 时,不同线程的数据可能会乱序到达。这种情况下,可能出现旧数据因较晚到达而错误覆盖新数据的情况。为解决这一问题,用户可以为旧数据指定较低的 sequence 值,为新数据指定较高的 sequence 值,从而让 Doris 根据用户提供的 sequence 值来正确确定数据的更新顺序。 此外,`UPDATE` 语句与通过导入实现更新在底层机制上存在较大差异。`UPDATE` 操作涉及两个步骤:从数据库中读取待更新的数据,以及写入更新后的数据。默认情况下,`UPDATE` 语句通过表级锁提供了 Serializable 隔离级别的事务能力,即多个 `UPDATE` 操作只能串行执行。用户也可以通过调整配置绕过这一限制,具体方法请参阅以下章节的详细说明。 @@ -42,53 +42,70 @@ Doris 采用多版本并发控制机制(MVCC - Multi-Version Concurrency Contr 但在实际应用中,如果用户自己可以保证即使并发更新,也不会同时对同一行进行操作的话,就可以手动打开并发限制。通过修改 FE 配置 `enable_concurrent_update`,当该配置值设置为 `true` 时,更新命令将不再提供事务保证。 -:::caution 注意: -开启 `enable_concurrent_update` 配置后,会有一定的性能风险 -::: - ## Sequence 列 Unique 模型主要针对需要唯一主键的场景,可以保证主键唯一性约束,在同一批次中导入或者不同批次中导入的数据,替换顺序不做保证。替换顺序无法保证则无法确定最终导入到表中的具体数据,存在了不确定性。 为了解决这个问题,Doris 支持了 sequence 列,通过用户在导入时指定 sequence 列,相同 key 列下,按照 sequence 列的值进行替换,较大值可以替换较小值,反之则无法替换。该方法将顺序的确定交给了用户,由用户控制替换顺序。 -:::note +在实现层面,Doris 增加了一个隐藏列 **__DORIS_SEQUENCE_COL__** ,该列的类型由用户在建表时指定,在导入时确定该列具体值,并依据该值决定相同 Key 列下,哪一行生效。 + +:::caution 注意 sequence 列目前只支持 Unique 模型。 ::: -### 基本原理 - -通过增加一个隐藏列**__DORIS_SEQUENCE_COL__** 实现,该列的类型由用户在建表时指定,在导入时确定该列具体值,并依据该值决定相同 Key 列下,哪一行生效。 - -**建表** +### 启用 sequence column 支持 -创建 Unique 表时,用户可以设置表中的某一列作为sequence列。 +在新建表时如果设置了 `function_column.sequence_col` 或者 `function_column.sequence_type` ,则新建表将支持 sequence column。 -**导入** +对于一个不支持 sequence column 的表,如果想要使用该功能,可以使用如下语句: `ALTER TABLE example_db.my_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")` 来启用。 -导入时,fe 在解析的过程中将隐藏列的值设置成 `order by` 表达式的值 (broker load 和 routine load),或者`function_column.sequence_col`表达式的值 (stream load),value 列将按照该值进行替换。隐藏列`__DORIS_SEQUENCE_COL__`的值既可以设置为数据源中一列,也可以是表结构中的一列。 +如果不确定一个表是否支持 sequence column,可以通过设置一个 session variable 来显示隐藏列 `SET show_hidden_columns=true` ,之后使用 `desc tablename`,如果输出中有 `__DORIS_SEQUENCE_COL__` 列则支持,如果没有则不支持。 -### 使用语法 +### 使用示例 -**Sequence 列建表时有两种方式,一种是建表时设置`sequence_col`属性,一种是建表时设置`sequence_type`属性。** +下面以 Stream Load 为例展示使用方式: -**1. 设置 `sequence_col`(推荐)** +**1. 创建支持 sequence col 的表** -创建 Unique 表时,指定 sequence 列到表中其他 column 的映射 +创建 unique 模型的 test_table 数据表,并指定 sequence 列映射到表中的 modify_date 列。 -```Plain -PROPERTIES ( - "function_column.sequence_col" = 'column_name', +```sql +CREATE TABLE test.test_table +( + user_id bigint, + date date, + group_id bigint, + modify_date date, + keyword VARCHAR(128) +) +UNIQUE KEY(user_id, date, group_id) +DISTRIBUTED BY HASH (user_id) BUCKETS 32 +PROPERTIES( + "function_column.sequence_col" = 'modify_date', + "replication_num" = "1", + "in_memory" = "false" ); ``` sequence_col 用来指定 sequence 列到表中某一列的映射,该列可以为整型和时间类型(DATE、DATETIME),创建后不能更改该列的类型。 -导入方式和没有 sequence 列时一样,使用相对比较简单,推荐使用。 +创建好的表结构如下: -**2. 设置 `sequence_type`** +```sql +MySQL> desc test_table; ++-------------+--------------+------+-------+---------+---------+ +| Field | Type | Null | Key | Default | Extra | ++-------------+--------------+------+-------+---------+---------+ +| user_id | BIGINT | No | true | NULL | | +| date | DATE | No | true | NULL | | +| group_id | BIGINT | No | true | NULL | | +| modify_date | DATE | No | false | NULL | REPLACE | +| keyword | VARCHAR(128) | No | false | NULL | REPLACE | ++-------------+--------------+------+-------+---------+---------+ +``` -创建 Uniq 表时,指定 sequence 列类型 +除了上述按照列映射的方式来指定 sequence 之外,Doris 还支持根据指定类型创建 sequence 列的语法,这种方式不要求建表时 schema 中必须有一列来做映射,下面是对应的语法: ```Plain PROPERTIES ( @@ -98,11 +115,43 @@ PROPERTIES ( sequence_type 用来指定 sequence 列的类型,可以为整型和时间类型(DATE、DATETIME)。 -**导入时需要指定 sequence 列到其他列的映射。** +**2. 导入数据:** + +使用列映射的方式 (`function_column.sequence_col`) 来指定 sequence 列,不需要修改任何参数。下面我们用 Stream Load 导入如下数据: + +```Plain +1 2020-02-22 1 2020-02-21 a +1 2020-02-22 1 2020-02-22 b +1 2020-02-22 1 2020-03-05 c +1 2020-02-22 1 2020-02-26 d +1 2020-02-22 1 2020-02-23 e +1 2020-02-22 1 2020-02-24 b +``` + +stream load 命令: + +```shell +curl --location-trusted -u root: -T testData http://host:port/api/test/test_table/_stream_load +``` + +结果为 + +```sql +MySQL> select * from test_table; ++---------+------------+----------+-------------+---------+ +| user_id | date | group_id | modify_date | keyword | ++---------+------------+----------+-------------+---------+ +| 1 | 2020-02-22 | 1 | 2020-03-05 | c | ++---------+------------+----------+-------------+---------+ +``` + +在这次导入中,因 sequence column 的值(也就是 modify_date 中的值)中 '2020-03-05' 为最大值,所以 keyword 列中最终保留了 c。 + +如果建表时使用了 `function_column.sequence_col` 方式来指定 sequence 列,在导入时需要指定 sequence 列到其他列的映射。 **1. Stream Load** -stream load 的写法是在 header 中的`function_column.sequence_col`字段添加隐藏列对应的 source_sequence 的映射,示例 +stream load 的写法是在 header 中的 `function_column.sequence_col` 字段添加隐藏列对应的 source_sequence 的映射,示例 ```shell curl --location-trusted -u root -H "columns: k1,k2,source_sequence,v1,v2" -H "function_column.sequence_col: source_sequence" -T testData http://host:port/api/testDb/testTbl/_stream_load @@ -110,7 +159,7 @@ curl --location-trusted -u root -H "columns: k1,k2,source_sequence,v1,v2" -H "fu **2. Broker Load** -在`ORDER BY` 处设置隐藏列映射的 source_sequence 字段 +在 `ORDER BY` 处设置隐藏列映射的 source_sequence 字段 ```sql LOAD LABEL db1.label1 @@ -159,87 +208,6 @@ CREATE ROUTINE LOAD example_db.test1 ON example_tbl ); ``` -### 启用 sequence column 支持 - -在新建表时如果设置了`function_column.sequence_col`或者`function_column.sequence_type` ,则新建表将支持 sequence column。 - -对于一个不支持 sequence column 的表,如果想要使用该功能,可以使用如下语句: `ALTER TABLE example_db.my_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")` 来启用。 - -如果不确定一个表是否支持 sequence column,可以通过设置一个 session variable 来显示隐藏列 `SET show_hidden_columns=true` ,之后使用`desc tablename`,如果输出中有`__DORIS_SEQUENCE_COL__` 列则支持,如果没有则不支持。 - -### 使用示例 - -下面以 Stream Load 为例为示例来展示使用方式: - -**1. 创建支持 sequence col 的表** - -创建 unique 模型的 test_table 数据表,并指定 sequence 列映射到表中的 modify_date 列。 - -```sql -CREATE TABLE test.test_table -( - user_id bigint, - date date, - group_id bigint, - modify_date date, - keyword VARCHAR(128) -) -UNIQUE KEY(user_id, date, group_id) -DISTRIBUTED BY HASH (user_id) BUCKETS 32 -PROPERTIES( - "function_column.sequence_col" = 'modify_date', - "replication_num" = "1", - "in_memory" = "false" -); -``` - -表结构如下: - -```sql -MySQL> desc test_table; -+-------------+--------------+------+-------+---------+---------+ -| Field | Type | Null | Key | Default | Extra | -+-------------+--------------+------+-------+---------+---------+ -| user_id | BIGINT | No | true | NULL | | -| date | DATE | No | true | NULL | | -| group_id | BIGINT | No | true | NULL | | -| modify_date | DATE | No | false | NULL | REPLACE | -| keyword | VARCHAR(128) | No | false | NULL | REPLACE | -+-------------+--------------+------+-------+---------+---------+ -``` - -**2. 正常导入数据:** - -导入如下数据 - -```Plain -1 2020-02-22 1 2020-02-21 a -1 2020-02-22 1 2020-02-22 b -1 2020-02-22 1 2020-03-05 c -1 2020-02-22 1 2020-02-26 d -1 2020-02-22 1 2020-02-23 e -1 2020-02-22 1 2020-02-24 b -``` - -此处以 stream load 为例 - -```shell -curl --location-trusted -u root: -T testData http://host:port/api/test/test_table/_stream_load -``` - -结果为 - -```sql -MySQL> select * from test_table; -+---------+------------+----------+-------------+---------+ -| user_id | date | group_id | modify_date | keyword | -+---------+------------+----------+-------------+---------+ -| 1 | 2020-02-22 | 1 | 2020-03-05 | c | -+---------+------------+----------+-------------+---------+ -``` - -在这次导入中,因 sequence column 的值(也就是 modify_date 中的值)中'2020-03-05'为最大值,所以 keyword 列中最终保留了 c。 - **3. 替换顺序的保证** 上述步骤完成后,接着导入如下数据 @@ -260,7 +228,7 @@ MySQL [test]> select * from test_table; +---------+------------+----------+-------------+---------+ ``` -在这次导入的数据中,会比较所有已导入数据的 sequence column (也就是 modify_date),其中'2020-03-05'为最大值,所以 keyword 列中最终保留了 c。 +在这次导入的数据中,会比较所有已导入数据的 sequence column (也就是 modify_date),其中 '2020-03-05' 为最大值,所以 keyword 列中最终保留了 c。 **4. 再尝试导入如下数据** diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/update/unique-update-concurrent-control.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/update/unique-update-concurrent-control.md index f46d317070..fdf269073f 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/update/unique-update-concurrent-control.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/data-operate/update/unique-update-concurrent-control.md @@ -42,53 +42,70 @@ Doris 采用多版本并发控制机制(MVCC - Multi-Version Concurrency Contr 但在实际应用中,如果用户自己可以保证即使并发更新,也不会同时对同一行进行操作的话,就可以手动打开并发限制。通过修改 FE 配置 `enable_concurrent_update`,当该配置值设置为 `true` 时,更新命令将不再提供事务保证。 -:::caution 注意: -开启 `enable_concurrent_update` 配置后,会有一定的性能风险 -::: - ## Sequence 列 Unique 模型主要针对需要唯一主键的场景,可以保证主键唯一性约束,在同一批次中导入或者不同批次中导入的数据,替换顺序不做保证。替换顺序无法保证则无法确定最终导入到表中的具体数据,存在了不确定性。 为了解决这个问题,Doris 支持了 sequence 列,通过用户在导入时指定 sequence 列,相同 key 列下,按照 sequence 列的值进行替换,较大值可以替换较小值,反之则无法替换。该方法将顺序的确定交给了用户,由用户控制替换顺序。 -:::note +在实现层面,Doris 增加了一个隐藏列 **__DORIS_SEQUENCE_COL__** ,该列的类型由用户在建表时指定,在导入时确定该列具体值,并依据该值决定相同 Key 列下,哪一行生效。 + +:::caution 注意 sequence 列目前只支持 Unique 模型。 ::: -### 基本原理 - -通过增加一个隐藏列**__DORIS_SEQUENCE_COL__**实现,该列的类型由用户在建表时指定,在导入时确定该列具体值,并依据该值决定相同 Key 列下,哪一行生效。 - -**建表** +### 启用 sequence column 支持 -创建 Unique 表时,将按照用户指定类型自动添加一个隐藏列__DORIS_SEQUENCE_COL__。 +在新建表时如果设置了 `function_column.sequence_col` 或者 `function_column.sequence_type` ,则新建表将支持 sequence column。 -**导入** +对于一个不支持 sequence column 的表,如果想要使用该功能,可以使用如下语句: `ALTER TABLE example_db.my_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")` 来启用。 -导入时,fe 在解析的过程中将隐藏列的值设置成 `order by` 表达式的值 (broker load 和 routine load),或者`function_column.sequence_col`表达式的值 (stream load),value 列将按照该值进行替换。隐藏列`__DORIS_SEQUENCE_COL__`的值既可以设置为数据源中一列,也可以是表结构中的一列。 +如果不确定一个表是否支持 sequence column,可以通过设置一个 session variable 来显示隐藏列 `SET show_hidden_columns=true` ,之后使用 `desc tablename`,如果输出中有 `__DORIS_SEQUENCE_COL__` 列则支持,如果没有则不支持。 -### 使用语法 +### 使用示例 -**Sequence 列建表时有两种方式,一种是建表时设置`sequence_col`属性,一种是建表时设置`sequence_type`属性。** +下面以 Stream Load 为例展示使用方式: -**1. 设置 `sequence_col`(推荐)** +**1. 创建支持 sequence col 的表** -创建 Unique 表时,指定 sequence 列到表中其他 column 的映射 +创建 unique 模型的 test_table 数据表,并指定 sequence 列映射到表中的 modify_date 列。 -```Plain -PROPERTIES ( - "function_column.sequence_col" = 'column_name', +```sql +CREATE TABLE test.test_table +( + user_id bigint, + date date, + group_id bigint, + modify_date date, + keyword VARCHAR(128) +) +UNIQUE KEY(user_id, date, group_id) +DISTRIBUTED BY HASH (user_id) BUCKETS 32 +PROPERTIES( + "function_column.sequence_col" = 'modify_date', + "replication_num" = "1", + "in_memory" = "false" ); ``` sequence_col 用来指定 sequence 列到表中某一列的映射,该列可以为整型和时间类型(DATE、DATETIME),创建后不能更改该列的类型。 -导入方式和没有 sequence 列时一样,使用相对比较简单,推荐使用。 +创建好的表结构如下: -**2. 设置 `sequence_type`** +```sql +MySQL> desc test_table; ++-------------+--------------+------+-------+---------+---------+ +| Field | Type | Null | Key | Default | Extra | ++-------------+--------------+------+-------+---------+---------+ +| user_id | BIGINT | No | true | NULL | | +| date | DATE | No | true | NULL | | +| group_id | BIGINT | No | true | NULL | | +| modify_date | DATE | No | false | NULL | REPLACE | +| keyword | VARCHAR(128) | No | false | NULL | REPLACE | ++-------------+--------------+------+-------+---------+---------+ +``` -创建 Unique 表时,指定 sequence 列类型 +除了上述按照列映射的方式来指定 sequence 之外,Doris 还支持根据指定类型创建 sequence 列的语法,这种方式不要求建表时 schema 中必须有一列来做映射,下面是对应的语法: ```Plain PROPERTIES ( @@ -98,11 +115,43 @@ PROPERTIES ( sequence_type 用来指定 sequence 列的类型,可以为整型和时间类型(DATE、DATETIME)。 -**导入时需要指定 sequence 列到其他列的映射。** +**2. 导入数据:** + +使用列映射的方式 (`function_column.sequence_col`) 来指定 sequence 列,不需要修改任何参数。下面我们用 Stream Load 导入如下数据: + +```Plain +1 2020-02-22 1 2020-02-21 a +1 2020-02-22 1 2020-02-22 b +1 2020-02-22 1 2020-03-05 c +1 2020-02-22 1 2020-02-26 d +1 2020-02-22 1 2020-02-23 e +1 2020-02-22 1 2020-02-24 b +``` + +stream load 命令: + +```shell +curl --location-trusted -u root: -T testData http://host:port/api/test/test_table/_stream_load +``` + +结果为 + +```sql +MySQL> select * from test_table; ++---------+------------+----------+-------------+---------+ +| user_id | date | group_id | modify_date | keyword | ++---------+------------+----------+-------------+---------+ +| 1 | 2020-02-22 | 1 | 2020-03-05 | c | ++---------+------------+----------+-------------+---------+ +``` + +在这次导入中,因 sequence column 的值(也就是 modify_date 中的值)中 '2020-03-05' 为最大值,所以 keyword 列中最终保留了 c。 + +如果建表时使用了 `function_column.sequence_col` 方式来指定 sequence 列,在导入时需要指定 sequence 列到其他列的映射。 **1. Stream Load** -stream load 的写法是在 header 中的`function_column.sequence_col`字段添加隐藏列对应的 source_sequence 的映射,示例 +stream load 的写法是在 header 中的 `function_column.sequence_col` 字段添加隐藏列对应的 source_sequence 的映射,示例 ```shell curl --location-trusted -u root -H "columns: k1,k2,source_sequence,v1,v2" -H "function_column.sequence_col: source_sequence" -T testData http://host:port/api/testDb/testTbl/_stream_load @@ -110,7 +159,7 @@ curl --location-trusted -u root -H "columns: k1,k2,source_sequence,v1,v2" -H "fu **2. Broker Load** -在`ORDER BY` 处设置隐藏列映射的 source_sequence 字段 +在 `ORDER BY` 处设置隐藏列映射的 source_sequence 字段 ```sql LOAD LABEL db1.label1 @@ -159,87 +208,6 @@ CREATE ROUTINE LOAD example_db.test1 ON example_tbl ); ``` -### 启用 sequence column 支持 - -在新建表时如果设置了`function_column.sequence_col`或者`function_column.sequence_type` ,则新建表将支持 sequence column。 - -对于一个不支持 sequence column 的表,如果想要使用该功能,可以使用如下语句: `ALTER TABLE example_db.my_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")` 来启用。 - -如果不确定一个表是否支持 sequence column,可以通过设置一个 session variable 来显示隐藏列 `SET show_hidden_columns=true` ,之后使用`desc tablename`,如果输出中有`__DORIS_SEQUENCE_COL__` 列则支持,如果没有则不支持。 - -### 使用示例 - -下面以 Stream Load 为例为示例来展示使用方式: - -**1. 创建支持 sequence col 的表** - -创建 unique 模型的 test_table 数据表,并指定 sequence 列映射到表中的 modify_date 列。 - -```sql -CREATE TABLE test.test_table -( - user_id bigint, - date date, - group_id bigint, - modify_date date, - keyword VARCHAR(128) -) -UNIQUE KEY(user_id, date, group_id) -DISTRIBUTED BY HASH (user_id) BUCKETS 32 -PROPERTIES( - "function_column.sequence_col" = 'modify_date', - "replication_num" = "1", - "in_memory" = "false" -); -``` - -表结构如下: - -```sql -MySQL> desc test_table; -+-------------+--------------+------+-------+---------+---------+ -| Field | Type | Null | Key | Default | Extra | -+-------------+--------------+------+-------+---------+---------+ -| user_id | BIGINT | No | true | NULL | | -| date | DATE | No | true | NULL | | -| group_id | BIGINT | No | true | NULL | | -| modify_date | DATE | No | false | NULL | REPLACE | -| keyword | VARCHAR(128) | No | false | NULL | REPLACE | -+-------------+--------------+------+-------+---------+---------+ -``` - -**2. 正常导入数据:** - -导入如下数据 - -```Plain -1 2020-02-22 1 2020-02-21 a -1 2020-02-22 1 2020-02-22 b -1 2020-02-22 1 2020-03-05 c -1 2020-02-22 1 2020-02-26 d -1 2020-02-22 1 2020-02-23 e -1 2020-02-22 1 2020-02-24 b -``` - -此处以 stream load 为例 - -```shell -curl --location-trusted -u root: -T testData http://host:port/api/test/test_table/_stream_load -``` - -结果为 - -```sql -MySQL> select * from test_table; -+---------+------------+----------+-------------+---------+ -| user_id | date | group_id | modify_date | keyword | -+---------+------------+----------+-------------+---------+ -| 1 | 2020-02-22 | 1 | 2020-03-05 | c | -+---------+------------+----------+-------------+---------+ -``` - -在这次导入中,因 sequence column 的值(也就是 modify_date 中的值)中'2020-03-05'为最大值,所以 keyword 列中最终保留了 c。 - **3. 替换顺序的保证** 上述步骤完成后,接着导入如下数据 @@ -260,7 +228,7 @@ MySQL [test]> select * from test_table; +---------+------------+----------+-------------+---------+ ``` -在这次导入的数据中,会比较所有已导入数据的 sequence column (也就是 modify_date),其中'2020-03-05'为最大值,所以 keyword 列中最终保留了 c。 +在这次导入的数据中,会比较所有已导入数据的 sequence column (也就是 modify_date),其中 '2020-03-05' 为最大值,所以 keyword 列中最终保留了 c。 **4. 再尝试导入如下数据** diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/update/unique-update-concurrent-control.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/update/unique-update-concurrent-control.md index f46d317070..fdf269073f 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/update/unique-update-concurrent-control.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/data-operate/update/unique-update-concurrent-control.md @@ -42,53 +42,70 @@ Doris 采用多版本并发控制机制(MVCC - Multi-Version Concurrency Contr 但在实际应用中,如果用户自己可以保证即使并发更新,也不会同时对同一行进行操作的话,就可以手动打开并发限制。通过修改 FE 配置 `enable_concurrent_update`,当该配置值设置为 `true` 时,更新命令将不再提供事务保证。 -:::caution 注意: -开启 `enable_concurrent_update` 配置后,会有一定的性能风险 -::: - ## Sequence 列 Unique 模型主要针对需要唯一主键的场景,可以保证主键唯一性约束,在同一批次中导入或者不同批次中导入的数据,替换顺序不做保证。替换顺序无法保证则无法确定最终导入到表中的具体数据,存在了不确定性。 为了解决这个问题,Doris 支持了 sequence 列,通过用户在导入时指定 sequence 列,相同 key 列下,按照 sequence 列的值进行替换,较大值可以替换较小值,反之则无法替换。该方法将顺序的确定交给了用户,由用户控制替换顺序。 -:::note +在实现层面,Doris 增加了一个隐藏列 **__DORIS_SEQUENCE_COL__** ,该列的类型由用户在建表时指定,在导入时确定该列具体值,并依据该值决定相同 Key 列下,哪一行生效。 + +:::caution 注意 sequence 列目前只支持 Unique 模型。 ::: -### 基本原理 - -通过增加一个隐藏列**__DORIS_SEQUENCE_COL__**实现,该列的类型由用户在建表时指定,在导入时确定该列具体值,并依据该值决定相同 Key 列下,哪一行生效。 - -**建表** +### 启用 sequence column 支持 -创建 Unique 表时,将按照用户指定类型自动添加一个隐藏列__DORIS_SEQUENCE_COL__。 +在新建表时如果设置了 `function_column.sequence_col` 或者 `function_column.sequence_type` ,则新建表将支持 sequence column。 -**导入** +对于一个不支持 sequence column 的表,如果想要使用该功能,可以使用如下语句: `ALTER TABLE example_db.my_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")` 来启用。 -导入时,fe 在解析的过程中将隐藏列的值设置成 `order by` 表达式的值 (broker load 和 routine load),或者`function_column.sequence_col`表达式的值 (stream load),value 列将按照该值进行替换。隐藏列`__DORIS_SEQUENCE_COL__`的值既可以设置为数据源中一列,也可以是表结构中的一列。 +如果不确定一个表是否支持 sequence column,可以通过设置一个 session variable 来显示隐藏列 `SET show_hidden_columns=true` ,之后使用 `desc tablename`,如果输出中有 `__DORIS_SEQUENCE_COL__` 列则支持,如果没有则不支持。 -### 使用语法 +### 使用示例 -**Sequence 列建表时有两种方式,一种是建表时设置`sequence_col`属性,一种是建表时设置`sequence_type`属性。** +下面以 Stream Load 为例展示使用方式: -**1. 设置 `sequence_col`(推荐)** +**1. 创建支持 sequence col 的表** -创建 Unique 表时,指定 sequence 列到表中其他 column 的映射 +创建 unique 模型的 test_table 数据表,并指定 sequence 列映射到表中的 modify_date 列。 -```Plain -PROPERTIES ( - "function_column.sequence_col" = 'column_name', +```sql +CREATE TABLE test.test_table +( + user_id bigint, + date date, + group_id bigint, + modify_date date, + keyword VARCHAR(128) +) +UNIQUE KEY(user_id, date, group_id) +DISTRIBUTED BY HASH (user_id) BUCKETS 32 +PROPERTIES( + "function_column.sequence_col" = 'modify_date', + "replication_num" = "1", + "in_memory" = "false" ); ``` sequence_col 用来指定 sequence 列到表中某一列的映射,该列可以为整型和时间类型(DATE、DATETIME),创建后不能更改该列的类型。 -导入方式和没有 sequence 列时一样,使用相对比较简单,推荐使用。 +创建好的表结构如下: -**2. 设置 `sequence_type`** +```sql +MySQL> desc test_table; ++-------------+--------------+------+-------+---------+---------+ +| Field | Type | Null | Key | Default | Extra | ++-------------+--------------+------+-------+---------+---------+ +| user_id | BIGINT | No | true | NULL | | +| date | DATE | No | true | NULL | | +| group_id | BIGINT | No | true | NULL | | +| modify_date | DATE | No | false | NULL | REPLACE | +| keyword | VARCHAR(128) | No | false | NULL | REPLACE | ++-------------+--------------+------+-------+---------+---------+ +``` -创建 Unique 表时,指定 sequence 列类型 +除了上述按照列映射的方式来指定 sequence 之外,Doris 还支持根据指定类型创建 sequence 列的语法,这种方式不要求建表时 schema 中必须有一列来做映射,下面是对应的语法: ```Plain PROPERTIES ( @@ -98,11 +115,43 @@ PROPERTIES ( sequence_type 用来指定 sequence 列的类型,可以为整型和时间类型(DATE、DATETIME)。 -**导入时需要指定 sequence 列到其他列的映射。** +**2. 导入数据:** + +使用列映射的方式 (`function_column.sequence_col`) 来指定 sequence 列,不需要修改任何参数。下面我们用 Stream Load 导入如下数据: + +```Plain +1 2020-02-22 1 2020-02-21 a +1 2020-02-22 1 2020-02-22 b +1 2020-02-22 1 2020-03-05 c +1 2020-02-22 1 2020-02-26 d +1 2020-02-22 1 2020-02-23 e +1 2020-02-22 1 2020-02-24 b +``` + +stream load 命令: + +```shell +curl --location-trusted -u root: -T testData http://host:port/api/test/test_table/_stream_load +``` + +结果为 + +```sql +MySQL> select * from test_table; ++---------+------------+----------+-------------+---------+ +| user_id | date | group_id | modify_date | keyword | ++---------+------------+----------+-------------+---------+ +| 1 | 2020-02-22 | 1 | 2020-03-05 | c | ++---------+------------+----------+-------------+---------+ +``` + +在这次导入中,因 sequence column 的值(也就是 modify_date 中的值)中 '2020-03-05' 为最大值,所以 keyword 列中最终保留了 c。 + +如果建表时使用了 `function_column.sequence_col` 方式来指定 sequence 列,在导入时需要指定 sequence 列到其他列的映射。 **1. Stream Load** -stream load 的写法是在 header 中的`function_column.sequence_col`字段添加隐藏列对应的 source_sequence 的映射,示例 +stream load 的写法是在 header 中的 `function_column.sequence_col` 字段添加隐藏列对应的 source_sequence 的映射,示例 ```shell curl --location-trusted -u root -H "columns: k1,k2,source_sequence,v1,v2" -H "function_column.sequence_col: source_sequence" -T testData http://host:port/api/testDb/testTbl/_stream_load @@ -110,7 +159,7 @@ curl --location-trusted -u root -H "columns: k1,k2,source_sequence,v1,v2" -H "fu **2. Broker Load** -在`ORDER BY` 处设置隐藏列映射的 source_sequence 字段 +在 `ORDER BY` 处设置隐藏列映射的 source_sequence 字段 ```sql LOAD LABEL db1.label1 @@ -159,87 +208,6 @@ CREATE ROUTINE LOAD example_db.test1 ON example_tbl ); ``` -### 启用 sequence column 支持 - -在新建表时如果设置了`function_column.sequence_col`或者`function_column.sequence_type` ,则新建表将支持 sequence column。 - -对于一个不支持 sequence column 的表,如果想要使用该功能,可以使用如下语句: `ALTER TABLE example_db.my_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")` 来启用。 - -如果不确定一个表是否支持 sequence column,可以通过设置一个 session variable 来显示隐藏列 `SET show_hidden_columns=true` ,之后使用`desc tablename`,如果输出中有`__DORIS_SEQUENCE_COL__` 列则支持,如果没有则不支持。 - -### 使用示例 - -下面以 Stream Load 为例为示例来展示使用方式: - -**1. 创建支持 sequence col 的表** - -创建 unique 模型的 test_table 数据表,并指定 sequence 列映射到表中的 modify_date 列。 - -```sql -CREATE TABLE test.test_table -( - user_id bigint, - date date, - group_id bigint, - modify_date date, - keyword VARCHAR(128) -) -UNIQUE KEY(user_id, date, group_id) -DISTRIBUTED BY HASH (user_id) BUCKETS 32 -PROPERTIES( - "function_column.sequence_col" = 'modify_date', - "replication_num" = "1", - "in_memory" = "false" -); -``` - -表结构如下: - -```sql -MySQL> desc test_table; -+-------------+--------------+------+-------+---------+---------+ -| Field | Type | Null | Key | Default | Extra | -+-------------+--------------+------+-------+---------+---------+ -| user_id | BIGINT | No | true | NULL | | -| date | DATE | No | true | NULL | | -| group_id | BIGINT | No | true | NULL | | -| modify_date | DATE | No | false | NULL | REPLACE | -| keyword | VARCHAR(128) | No | false | NULL | REPLACE | -+-------------+--------------+------+-------+---------+---------+ -``` - -**2. 正常导入数据:** - -导入如下数据 - -```Plain -1 2020-02-22 1 2020-02-21 a -1 2020-02-22 1 2020-02-22 b -1 2020-02-22 1 2020-03-05 c -1 2020-02-22 1 2020-02-26 d -1 2020-02-22 1 2020-02-23 e -1 2020-02-22 1 2020-02-24 b -``` - -此处以 stream load 为例 - -```shell -curl --location-trusted -u root: -T testData http://host:port/api/test/test_table/_stream_load -``` - -结果为 - -```sql -MySQL> select * from test_table; -+---------+------------+----------+-------------+---------+ -| user_id | date | group_id | modify_date | keyword | -+---------+------------+----------+-------------+---------+ -| 1 | 2020-02-22 | 1 | 2020-03-05 | c | -+---------+------------+----------+-------------+---------+ -``` - -在这次导入中,因 sequence column 的值(也就是 modify_date 中的值)中'2020-03-05'为最大值,所以 keyword 列中最终保留了 c。 - **3. 替换顺序的保证** 上述步骤完成后,接着导入如下数据 @@ -260,7 +228,7 @@ MySQL [test]> select * from test_table; +---------+------------+----------+-------------+---------+ ``` -在这次导入的数据中,会比较所有已导入数据的 sequence column (也就是 modify_date),其中'2020-03-05'为最大值,所以 keyword 列中最终保留了 c。 +在这次导入的数据中,会比较所有已导入数据的 sequence column (也就是 modify_date),其中 '2020-03-05' 为最大值,所以 keyword 列中最终保留了 c。 **4. 再尝试导入如下数据** diff --git a/versioned_docs/version-2.1/data-operate/update/unique-update-concurrent-control.md b/versioned_docs/version-2.1/data-operate/update/unique-update-concurrent-control.md index 943d7e08bd..08f9f836c6 100644 --- a/versioned_docs/version-2.1/data-operate/update/unique-update-concurrent-control.md +++ b/versioned_docs/version-2.1/data-operate/update/unique-update-concurrent-control.md @@ -26,69 +26,86 @@ under the License. ## Overview -Doris adopts a Multi-Version Concurrency Control (MVCC) mechanism to manage concurrent updates. Each data load operation is assigned a transaction, which ensures atomicity (i.e., the operation either fully succeeds or completely fails). Upon transaction commit, the system assigns a version number. When using the Unique Key model and loading multiple batches of data with duplicate primary keys, Doris determines the overwrite order based on the version number: data with a higher version nu [...] +Doris employs Multi-Version Concurrency Control (MVCC) to handle concurrent updates. Each data write operation is assigned a write transaction, ensuring atomicity (i.e., the write operation either fully succeeds or fully fails). Upon committing the write transaction, the system assigns it a version number. In the Unique Key model, when loading data multiple times, if there are duplicate primary keys, Doris determines the overwrite order based on the version number: data with a higher ver [...] -In certain scenarios, users may need to specify a sequence column in the table creation statement to customize the order in which data takes effect. For example, when synchronizing data into Doris using multiple concurrent processes, the data may arrive out of order. This could lead to older data overwriting newer data due to its delayed arrival. To address this, users can assign a lower sequence value to the older data and a higher sequence value to the newer data, enabling Doris to det [...] +In some scenarios, users may need to adjust the effective order of data by specifying a sequence column in the table creation statement. For example, when synchronizing data to Doris concurrently through multiple threads, data from different threads may arrive out of order. In this case, old data arriving later may incorrectly overwrite new data. To solve this problem, users can assign a lower sequence value to old data and a higher sequence value to new data, allowing Doris to correctly [...] -Additionally, `UPDATE` statements differ significantly from updates performed via data loads at the implementation level. An `UPDATE` operation involves two steps: reading the data to be updated from the database and writing the updated data back. By default, `UPDATE` statements use table-level locks to provide transaction capabilities with Serializable isolation, meaning multiple `UPDATE` statements must be executed serially. However, users can bypass this restriction by modifying the c [...] +Additionally, the `UPDATE` statement differs significantly from updates implemented through data loads at the underlying mechanism level. The `UPDATE` operation involves two steps: reading the data to be updated from the database and writing the updated data. By default, the `UPDATE` statement provides transaction capabilities with Serializable isolation level through table-level locks, meaning multiple `UPDATE` operations can only be executed serially. Users can also bypass this restric [...] -## Update Concurrency Control +## UPDATE Concurrency Control -By default, concurrent `UPDATE`s on the same table are not allowed in Doris. +By default, concurrent `UPDATE` operations on the same table are not allowed. -The main reason is that Doris currently supports row-level updates, which means that even if the user specifies to update only a specific column (e.g., `SET v2 = 1`), all other value columns will be overwritten as well (even though their values remain unchanged). +The main reason is that Doris currently supports row updates, which means that even if the user declares `SET v2 = 1`, all other value columns will also be overwritten (even if the values have not changed). -This poses a problem when multiple `UPDATE` operations are performed concurrently on the same row. The behavior becomes unpredictable, and it may lead to inconsistent or "dirty" data. +This can lead to a problem where if two `UPDATE` operations update the same row simultaneously, the behavior may be indeterminate, potentially resulting in dirty data. -However, in practical applications, if the user can ensure that concurrent updates will not affect the same row simultaneously, they can manually enable the concurrent update restriction. This can be done by modifying the FE (Frontend) configuration `enable_concurrent_update`. When this configuration is set to `true`, the update command will not have transaction guarantees. - -:::caution Caution: -Enabling the `enable_concurrent_update` configuration may introduce certain performance risks. -::: +However, in practical applications, if users can ensure that concurrent updates do not operate on the same row simultaneously, they can manually enable concurrent updates. By modifying the FE configuration `enable_concurrent_update`, setting this configuration value to `true` will disable transaction guarantees for update commands. ## Sequence Column -The Unique model primarily caters to scenarios that require unique primary keys, ensuring the uniqueness constraint. When loading data in the same batch or different batches, the replacement order is not guaranteed. The uncertainty in the replacement order results in ambiguity in the specific data loaded into the table. +The Unique model is mainly for scenarios requiring unique primary keys, ensuring the uniqueness constraint of the primary key. The replacement order of data loaded in the same batch or different batches is not guaranteed. Without a guaranteed replacement order, the specific data ultimately loaded into the table is uncertain. -To address this issue, Doris supports sequence columns. Users can specify a sequence column during data load, allowing the replacement order to be controlled by the user. The sequence column determines the order of replacements for rows with the same key column. A higher sequence value can replace a lower one, but not vice versa. This method delegates the determination of order to the user, enabling control over the replacement sequence. +To solve this problem, Doris supports sequence columns. By specifying a sequence column during loading, data with the same key column is replaced based on the sequence column value, with larger values replacing smaller ones, and vice versa. This method allows users to control the replacement order. -:::note -Sequence columns are currently supported only in the Unique model. -::: +In implementation, Doris adds a hidden column **__DORIS_SEQUENCE_COL__**, whose type is specified by the user during table creation. The specific value of this column is determined during data loading, and the effective row for the same key column is decided based on this value. -### Basic Principles - -The basic principle is achieved by adding a hidden column called **__DORIS_SEQUENCE_COL__**. The type of this column is specified by the user during table creation and its specific value is determined during data load. Based on this value, the row that takes effect is determined for rows with the same key column. +:::caution Note +The sequence column currently only supports the Unique model. +::: -**Table Creation** +### Enabling Sequence Column Support -When creating a Unique table, an automatically added hidden column called __DORIS_SEQUENCE_COL__ is created, based on the user-specified type. +When creating a new table, if `function_column.sequence_col` or `function_column.sequence_type` is set, the new table will support sequence columns. -**Data load** +For a table that does not support sequence columns, you can enable this feature using the following statement: `ALTER TABLE example_db.my_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")`. -During data load, the FE (Frontend) sets the value of the hidden column as the value of the `ORDER BY` expression (for broker load and routine load) or the value of the `function_column.sequence_col` expression (for stream load). The value column is replaced based on this sequence value. The value of the hidden column, `__DORIS_SEQUENCE_COL__`, can be set as a column in the data source or a column in the table structure. +To check if a table supports sequence columns, you can set a session variable to display hidden columns `SET show_hidden_columns=true`, then use `desc tablename`. If the output includes the `__DORIS_SEQUENCE_COL__` column, it is supported; otherwise, it is not. -### Syntax Usage +### Usage Example -**Sequence Column has two ways to create a table, one is to set the `sequence_col` attribute when creating a table, and the other is to set the `sequence_type` attribute when creating a table.** +Below is an example of using Stream Load: -**1. Set `sequence_col` (Recommended)** +**1. Create a table supporting sequence columns** -When creating a Unique table, specify the mapping of the sequence column to other columns in the table. +Create a unique model `test_table` and map the sequence column to the `modify_date` column. -```Plain -PROPERTIES ( - "function_column.sequence_col" = 'column_name', +```sql +CREATE TABLE test.test_table +( + user_id bigint, + date date, + group_id bigint, + modify_date date, + keyword VARCHAR(128) +) +UNIQUE KEY(user_id, date, group_id) +DISTRIBUTED BY HASH (user_id) BUCKETS 32 +PROPERTIES( + "function_column.sequence_col" = 'modify_date', + "replication_num" = "1", + "in_memory" = "false" ); ``` -`sequence_col` is used to specify the mapping of the sequence column to a column in the table. The column can be of type integer or time type (DATE, DATETIME), and its type cannot be changed after creation. +The `sequence_col` specifies the mapping of the sequence column to a column in the table. This column can be of integer or date/time type (DATE, DATETIME) and cannot be changed after creation. -The load method is the same as when there is no sequence column, making it relatively simple. This method is recommended. +The table structure is as follows: -**2. Set `sequence_type`** +```sql +MySQL> desc test_table; ++-------------+--------------+------+-------+---------+---------+ +| Field | Type | Null | Key | Default | Extra | ++-------------+--------------+------+-------+---------+---------+ +| user_id | BIGINT | No | true | NULL | | +| date | DATE | No | true | NULL | | +| group_id | BIGINT | No | true | NULL | | +| modify_date | DATE | No | false | NULL | REPLACE | +| keyword | VARCHAR(128) | No | false | NULL | REPLACE | ++-------------+--------------+------+-------+---------+---------+ +``` -When creating a Unique table, specify the type of the sequence column. +In addition to specifying the sequence column through column mapping, Doris also supports creating a sequence column based on a specified type. This method does not require a column in the schema for mapping. The syntax is as follows: ```Plain PROPERTIES ( @@ -96,13 +113,45 @@ PROPERTIES ( ); ``` -`sequence_type` is used to specify the type of the sequence column, which can be integer or time type (DATE, DATETIME). +The `sequence_type` specifies the type of the sequence column, which can be integer or date/time type (DATE, DATETIME). + +**2. Load Data:** + +Using column mapping (`function_column.sequence_col`) to specify the sequence column does not require modifying any parameters. Below is an example of loading data using Stream Load: + +```Plain +1 2020-02-22 1 2020-02-21 a +1 2020-02-22 1 2020-02-22 b +1 2020-02-22 1 2020-03-05 c +1 2020-02-22 1 2020-02-26 d +1 2020-02-22 1 2020-02-23 e +1 2020-02-22 1 2020-02-24 b +``` + +Stream load command: + +```shell +curl --location-trusted -u root: -T testData http://host:port/api/test/test_table/_stream_load +``` + +Result: + +```sql +MySQL> select * from test_table; ++---------+------------+----------+-------------+---------+ +| user_id | date | group_id | modify_date | keyword | ++---------+------------+----------+-------------+---------+ +| 1 | 2020-02-22 | 1 | 2020-03-05 | c | ++---------+------------+----------+-------------+---------+ +``` + +In this load job, the value '2020-03-05' in the sequence column (modify_date) is the largest, so the keyword column retains 'c'. -**During data load, you need to specify the mapping of the sequence column to other columns.** +If the sequence column is specified using `function_column.sequence_col` during table creation, the sequence column mapping must be specified during load. **1. Stream Load** -The syntax for stream load is to add the mapping of the hidden column `function_column.sequence_col` to the `source_sequence` in the header. Example: +In Stream Load, specify the sequence column mapping in the header: ```shell curl --location-trusted -u root -H "columns: k1,k2,source_sequence,v1,v2" -H "function_column.sequence_col: source_sequence" -T testData http://host:port/api/testDb/testTbl/_stream_load @@ -110,7 +159,7 @@ curl --location-trusted -u root -H "columns: k1,k2,source_sequence,v1,v2" -H "fu **2. Broker Load** -Set the mapping of the hidden column `source_sequence` in the `ORDER BY` clause. +Set the hidden column mapping in the `ORDER BY` clause: ```sql LOAD LABEL db1.label1 @@ -159,88 +208,7 @@ CREATE ROUTINE LOAD example_db.test1 ON example_tbl ); ``` -### Enabling Sequence Column Support - -If `function_column.sequence_col` or `function_column.sequence_type` is set when creating a new table, the new table will support sequence columns. - -For a table that does not support sequence columns, if you want to use this feature, you can use the following statement: `ALTER TABLE example_db.my_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")` to enable it. - -If you are unsure whether a table supports sequence columns, you can set a session variable to display hidden columns with `SET show_hidden_columns=true`, and then use `desc tablename`. If the output includes the `__DORIS_SEQUENCE_COL__` column, it means that the table supports sequence columns; otherwise, it does not. - -### Usage Example - -Here is an example using Stream Load to demonstrate the usage: - -**1. Create a table with sequence col support** - -Create a unique model `test_table` and specify the sequence column mapping to the `modify_date` column in the table. - -```sql -CREATE TABLE test.test_table -( - user_id bigint, - date date, - group_id bigint, - modify_date date, - keyword VARCHAR(128) -) -UNIQUE KEY(user_id, date, group_id) -DISTRIBUTED BY HASH (user_id) BUCKETS 32 -PROPERTIES( - "function_column.sequence_col" = 'modify_date', - "replication_num" = "1", - "in_memory" = "false" -); -``` - -Table structure: - -```sql -MySQL> desc test_table; -+-------------+--------------+------+-------+---------+---------+ -| Field | Type | Null | Key | Default | Extra | -+-------------+--------------+------+-------+---------+---------+ -| user_id | BIGINT | No | true | NULL | | -| date | DATE | No | true | NULL | | -| group_id | BIGINT | No | true | NULL | | -| modify_date | DATE | No | false | NULL | REPLACE | -| keyword | VARCHAR(128) | No | false | NULL | REPLACE | -+-------------+--------------+------+-------+---------+---------+ -``` - -**2. Load data normally:** - -Load the following data: - -```Plain -1 2020-02-22 1 2020-02-21 a -1 2020-02-22 1 2020-02-22 b -1 2020-02-22 1 2020-03-05 c -1 2020-02-22 1 2020-02-26 d -1 2020-02-22 1 2020-02-23 e -1 2020-02-22 1 2020-02-24 b -``` - -Here is an example using Stream Load: - -```shell -curl --location-trusted -u root: -T testData http://host:port/api/test/test_table/_stream_load -``` - -The result is: - -```sql -MySQL> select * from test_table; -+---------+------------+----------+-------------+---------+ -| user_id | date | group_id | modify_date | keyword | -+---------+------------+----------+-------------+---------+ -| 1 | 2020-02-22 | 1 | 2020-03-05 | c | -+---------+------------+----------+-------------+---------+ -``` - -In the data load, because the value of the sequence column (i.e., modify_date) '2020-03-05' is the maximum, the keyword column retains the value 'c'. - -**3. Guarantee the order of replacement** +**3. Ensuring Replacement Order** After completing the above steps, load the following data: @@ -249,7 +217,7 @@ After completing the above steps, load the following data: 1 2020-02-22 1 2020-02-23 b ``` -Query the data: +Query data: ```sql MySQL [test]> select * from test_table; @@ -260,16 +228,16 @@ MySQL [test]> select * from test_table; +---------+------------+----------+-------------+---------+ ``` -In the loaded data, the sequence column (modify_date) of all previously loaded data is compared, and '2020-03-05' is the maximum. Therefore, the keyword column retains the value 'c'. +In this load, the sequence column value '2020-03-05' is the largest, so the keyword column retains 'c'. -**4. Try loading the following data again** +**4. Try Loading the Following Data** ```Plain 1 2020-02-22 1 2020-02-22 a 1 2020-02-22 1 2020-03-23 w ``` -Query the data: +Query data: ```sql MySQL [test]> select * from test_table; @@ -280,16 +248,16 @@ MySQL [test]> select * from test_table; +---------+------------+----------+-------------+---------+ ``` -Now the original data in the table can be replaced. In summary, during thestream load process, the sequence column is used to determine the order of replacement for duplicate records. The record with the maximum value in the sequence column will be retained in the table. +This time, the data in the table is replaced. In summary, during the load process, the sequence column values of all batches are compared, and the record with the largest value is loaded into the Doris table. ### Note -1. To prevent misuse, users must explicitly specify the sequence column in loading tasks such as StreamLoad/BrokerLoad and in insert statements for row updates (unless the default value of the sequence column is CURRENT_TIMESTAMP). Otherwise, the following error message will be received: +1. To prevent misuse, in StreamLoad/BrokerLoad load tasks and row update insert statements, users must explicitly specify the sequence column (unless the default value of the sequence column is CURRENT_TIMESTAMP), otherwise, the following error message will be received: ```Plain Table test_tbl has sequence column, need to specify the sequence column ``` -2. Since version 2.0, Doris has supported partial column updates for Merge-on-Write implementation of Unique Key tables. In partial column update, users can update only a subset of columns each time, so it is not necessary to include the sequence column. If the loading task submitted by the user includes the sequence column, it has no effect. If the loading task submitted by the user does not include the sequence column, Doris will use the value of the matching sequence column from the h [...] +2. Since version 2.0, Doris supports partial column update capability for Unique Key tables with Merge-on-Write implementation. In partial column update loads, users can update only a portion of the columns each time, so it is not necessary to include the sequence column. If the load task submitted by the user includes the sequence column, the behavior is unaffected; if the load task does not include the sequence column, Doris will use the sequence column from the matching historical dat [...] -3. In cases of concurrent data load, Doris utilizes MVCC (Multi-Version Concurrency Control) mechanism to ensure data correctness. If two batches of loaded data update different columns of the same key, the load task with a higher system version will reapply the data for the same key written by the load task with a lower version after the lower version load task succeeds. +3. During concurrent loads, Doris uses the MVCC mechanism to ensure data correctness. If two batches of data loads update different columns of the same key, the load task with the higher system version will use the data row written by the lower version load task to fill in the same key after the lower version load task succeeds. diff --git a/versioned_docs/version-3.0/data-operate/update/unique-update-concurrent-control.md b/versioned_docs/version-3.0/data-operate/update/unique-update-concurrent-control.md index 943d7e08bd..08f9f836c6 100644 --- a/versioned_docs/version-3.0/data-operate/update/unique-update-concurrent-control.md +++ b/versioned_docs/version-3.0/data-operate/update/unique-update-concurrent-control.md @@ -26,69 +26,86 @@ under the License. ## Overview -Doris adopts a Multi-Version Concurrency Control (MVCC) mechanism to manage concurrent updates. Each data load operation is assigned a transaction, which ensures atomicity (i.e., the operation either fully succeeds or completely fails). Upon transaction commit, the system assigns a version number. When using the Unique Key model and loading multiple batches of data with duplicate primary keys, Doris determines the overwrite order based on the version number: data with a higher version nu [...] +Doris employs Multi-Version Concurrency Control (MVCC) to handle concurrent updates. Each data write operation is assigned a write transaction, ensuring atomicity (i.e., the write operation either fully succeeds or fully fails). Upon committing the write transaction, the system assigns it a version number. In the Unique Key model, when loading data multiple times, if there are duplicate primary keys, Doris determines the overwrite order based on the version number: data with a higher ver [...] -In certain scenarios, users may need to specify a sequence column in the table creation statement to customize the order in which data takes effect. For example, when synchronizing data into Doris using multiple concurrent processes, the data may arrive out of order. This could lead to older data overwriting newer data due to its delayed arrival. To address this, users can assign a lower sequence value to the older data and a higher sequence value to the newer data, enabling Doris to det [...] +In some scenarios, users may need to adjust the effective order of data by specifying a sequence column in the table creation statement. For example, when synchronizing data to Doris concurrently through multiple threads, data from different threads may arrive out of order. In this case, old data arriving later may incorrectly overwrite new data. To solve this problem, users can assign a lower sequence value to old data and a higher sequence value to new data, allowing Doris to correctly [...] -Additionally, `UPDATE` statements differ significantly from updates performed via data loads at the implementation level. An `UPDATE` operation involves two steps: reading the data to be updated from the database and writing the updated data back. By default, `UPDATE` statements use table-level locks to provide transaction capabilities with Serializable isolation, meaning multiple `UPDATE` statements must be executed serially. However, users can bypass this restriction by modifying the c [...] +Additionally, the `UPDATE` statement differs significantly from updates implemented through data loads at the underlying mechanism level. The `UPDATE` operation involves two steps: reading the data to be updated from the database and writing the updated data. By default, the `UPDATE` statement provides transaction capabilities with Serializable isolation level through table-level locks, meaning multiple `UPDATE` operations can only be executed serially. Users can also bypass this restric [...] -## Update Concurrency Control +## UPDATE Concurrency Control -By default, concurrent `UPDATE`s on the same table are not allowed in Doris. +By default, concurrent `UPDATE` operations on the same table are not allowed. -The main reason is that Doris currently supports row-level updates, which means that even if the user specifies to update only a specific column (e.g., `SET v2 = 1`), all other value columns will be overwritten as well (even though their values remain unchanged). +The main reason is that Doris currently supports row updates, which means that even if the user declares `SET v2 = 1`, all other value columns will also be overwritten (even if the values have not changed). -This poses a problem when multiple `UPDATE` operations are performed concurrently on the same row. The behavior becomes unpredictable, and it may lead to inconsistent or "dirty" data. +This can lead to a problem where if two `UPDATE` operations update the same row simultaneously, the behavior may be indeterminate, potentially resulting in dirty data. -However, in practical applications, if the user can ensure that concurrent updates will not affect the same row simultaneously, they can manually enable the concurrent update restriction. This can be done by modifying the FE (Frontend) configuration `enable_concurrent_update`. When this configuration is set to `true`, the update command will not have transaction guarantees. - -:::caution Caution: -Enabling the `enable_concurrent_update` configuration may introduce certain performance risks. -::: +However, in practical applications, if users can ensure that concurrent updates do not operate on the same row simultaneously, they can manually enable concurrent updates. By modifying the FE configuration `enable_concurrent_update`, setting this configuration value to `true` will disable transaction guarantees for update commands. ## Sequence Column -The Unique model primarily caters to scenarios that require unique primary keys, ensuring the uniqueness constraint. When loading data in the same batch or different batches, the replacement order is not guaranteed. The uncertainty in the replacement order results in ambiguity in the specific data loaded into the table. +The Unique model is mainly for scenarios requiring unique primary keys, ensuring the uniqueness constraint of the primary key. The replacement order of data loaded in the same batch or different batches is not guaranteed. Without a guaranteed replacement order, the specific data ultimately loaded into the table is uncertain. -To address this issue, Doris supports sequence columns. Users can specify a sequence column during data load, allowing the replacement order to be controlled by the user. The sequence column determines the order of replacements for rows with the same key column. A higher sequence value can replace a lower one, but not vice versa. This method delegates the determination of order to the user, enabling control over the replacement sequence. +To solve this problem, Doris supports sequence columns. By specifying a sequence column during loading, data with the same key column is replaced based on the sequence column value, with larger values replacing smaller ones, and vice versa. This method allows users to control the replacement order. -:::note -Sequence columns are currently supported only in the Unique model. -::: +In implementation, Doris adds a hidden column **__DORIS_SEQUENCE_COL__**, whose type is specified by the user during table creation. The specific value of this column is determined during data loading, and the effective row for the same key column is decided based on this value. -### Basic Principles - -The basic principle is achieved by adding a hidden column called **__DORIS_SEQUENCE_COL__**. The type of this column is specified by the user during table creation and its specific value is determined during data load. Based on this value, the row that takes effect is determined for rows with the same key column. +:::caution Note +The sequence column currently only supports the Unique model. +::: -**Table Creation** +### Enabling Sequence Column Support -When creating a Unique table, an automatically added hidden column called __DORIS_SEQUENCE_COL__ is created, based on the user-specified type. +When creating a new table, if `function_column.sequence_col` or `function_column.sequence_type` is set, the new table will support sequence columns. -**Data load** +For a table that does not support sequence columns, you can enable this feature using the following statement: `ALTER TABLE example_db.my_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")`. -During data load, the FE (Frontend) sets the value of the hidden column as the value of the `ORDER BY` expression (for broker load and routine load) or the value of the `function_column.sequence_col` expression (for stream load). The value column is replaced based on this sequence value. The value of the hidden column, `__DORIS_SEQUENCE_COL__`, can be set as a column in the data source or a column in the table structure. +To check if a table supports sequence columns, you can set a session variable to display hidden columns `SET show_hidden_columns=true`, then use `desc tablename`. If the output includes the `__DORIS_SEQUENCE_COL__` column, it is supported; otherwise, it is not. -### Syntax Usage +### Usage Example -**Sequence Column has two ways to create a table, one is to set the `sequence_col` attribute when creating a table, and the other is to set the `sequence_type` attribute when creating a table.** +Below is an example of using Stream Load: -**1. Set `sequence_col` (Recommended)** +**1. Create a table supporting sequence columns** -When creating a Unique table, specify the mapping of the sequence column to other columns in the table. +Create a unique model `test_table` and map the sequence column to the `modify_date` column. -```Plain -PROPERTIES ( - "function_column.sequence_col" = 'column_name', +```sql +CREATE TABLE test.test_table +( + user_id bigint, + date date, + group_id bigint, + modify_date date, + keyword VARCHAR(128) +) +UNIQUE KEY(user_id, date, group_id) +DISTRIBUTED BY HASH (user_id) BUCKETS 32 +PROPERTIES( + "function_column.sequence_col" = 'modify_date', + "replication_num" = "1", + "in_memory" = "false" ); ``` -`sequence_col` is used to specify the mapping of the sequence column to a column in the table. The column can be of type integer or time type (DATE, DATETIME), and its type cannot be changed after creation. +The `sequence_col` specifies the mapping of the sequence column to a column in the table. This column can be of integer or date/time type (DATE, DATETIME) and cannot be changed after creation. -The load method is the same as when there is no sequence column, making it relatively simple. This method is recommended. +The table structure is as follows: -**2. Set `sequence_type`** +```sql +MySQL> desc test_table; ++-------------+--------------+------+-------+---------+---------+ +| Field | Type | Null | Key | Default | Extra | ++-------------+--------------+------+-------+---------+---------+ +| user_id | BIGINT | No | true | NULL | | +| date | DATE | No | true | NULL | | +| group_id | BIGINT | No | true | NULL | | +| modify_date | DATE | No | false | NULL | REPLACE | +| keyword | VARCHAR(128) | No | false | NULL | REPLACE | ++-------------+--------------+------+-------+---------+---------+ +``` -When creating a Unique table, specify the type of the sequence column. +In addition to specifying the sequence column through column mapping, Doris also supports creating a sequence column based on a specified type. This method does not require a column in the schema for mapping. The syntax is as follows: ```Plain PROPERTIES ( @@ -96,13 +113,45 @@ PROPERTIES ( ); ``` -`sequence_type` is used to specify the type of the sequence column, which can be integer or time type (DATE, DATETIME). +The `sequence_type` specifies the type of the sequence column, which can be integer or date/time type (DATE, DATETIME). + +**2. Load Data:** + +Using column mapping (`function_column.sequence_col`) to specify the sequence column does not require modifying any parameters. Below is an example of loading data using Stream Load: + +```Plain +1 2020-02-22 1 2020-02-21 a +1 2020-02-22 1 2020-02-22 b +1 2020-02-22 1 2020-03-05 c +1 2020-02-22 1 2020-02-26 d +1 2020-02-22 1 2020-02-23 e +1 2020-02-22 1 2020-02-24 b +``` + +Stream load command: + +```shell +curl --location-trusted -u root: -T testData http://host:port/api/test/test_table/_stream_load +``` + +Result: + +```sql +MySQL> select * from test_table; ++---------+------------+----------+-------------+---------+ +| user_id | date | group_id | modify_date | keyword | ++---------+------------+----------+-------------+---------+ +| 1 | 2020-02-22 | 1 | 2020-03-05 | c | ++---------+------------+----------+-------------+---------+ +``` + +In this load job, the value '2020-03-05' in the sequence column (modify_date) is the largest, so the keyword column retains 'c'. -**During data load, you need to specify the mapping of the sequence column to other columns.** +If the sequence column is specified using `function_column.sequence_col` during table creation, the sequence column mapping must be specified during load. **1. Stream Load** -The syntax for stream load is to add the mapping of the hidden column `function_column.sequence_col` to the `source_sequence` in the header. Example: +In Stream Load, specify the sequence column mapping in the header: ```shell curl --location-trusted -u root -H "columns: k1,k2,source_sequence,v1,v2" -H "function_column.sequence_col: source_sequence" -T testData http://host:port/api/testDb/testTbl/_stream_load @@ -110,7 +159,7 @@ curl --location-trusted -u root -H "columns: k1,k2,source_sequence,v1,v2" -H "fu **2. Broker Load** -Set the mapping of the hidden column `source_sequence` in the `ORDER BY` clause. +Set the hidden column mapping in the `ORDER BY` clause: ```sql LOAD LABEL db1.label1 @@ -159,88 +208,7 @@ CREATE ROUTINE LOAD example_db.test1 ON example_tbl ); ``` -### Enabling Sequence Column Support - -If `function_column.sequence_col` or `function_column.sequence_type` is set when creating a new table, the new table will support sequence columns. - -For a table that does not support sequence columns, if you want to use this feature, you can use the following statement: `ALTER TABLE example_db.my_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")` to enable it. - -If you are unsure whether a table supports sequence columns, you can set a session variable to display hidden columns with `SET show_hidden_columns=true`, and then use `desc tablename`. If the output includes the `__DORIS_SEQUENCE_COL__` column, it means that the table supports sequence columns; otherwise, it does not. - -### Usage Example - -Here is an example using Stream Load to demonstrate the usage: - -**1. Create a table with sequence col support** - -Create a unique model `test_table` and specify the sequence column mapping to the `modify_date` column in the table. - -```sql -CREATE TABLE test.test_table -( - user_id bigint, - date date, - group_id bigint, - modify_date date, - keyword VARCHAR(128) -) -UNIQUE KEY(user_id, date, group_id) -DISTRIBUTED BY HASH (user_id) BUCKETS 32 -PROPERTIES( - "function_column.sequence_col" = 'modify_date', - "replication_num" = "1", - "in_memory" = "false" -); -``` - -Table structure: - -```sql -MySQL> desc test_table; -+-------------+--------------+------+-------+---------+---------+ -| Field | Type | Null | Key | Default | Extra | -+-------------+--------------+------+-------+---------+---------+ -| user_id | BIGINT | No | true | NULL | | -| date | DATE | No | true | NULL | | -| group_id | BIGINT | No | true | NULL | | -| modify_date | DATE | No | false | NULL | REPLACE | -| keyword | VARCHAR(128) | No | false | NULL | REPLACE | -+-------------+--------------+------+-------+---------+---------+ -``` - -**2. Load data normally:** - -Load the following data: - -```Plain -1 2020-02-22 1 2020-02-21 a -1 2020-02-22 1 2020-02-22 b -1 2020-02-22 1 2020-03-05 c -1 2020-02-22 1 2020-02-26 d -1 2020-02-22 1 2020-02-23 e -1 2020-02-22 1 2020-02-24 b -``` - -Here is an example using Stream Load: - -```shell -curl --location-trusted -u root: -T testData http://host:port/api/test/test_table/_stream_load -``` - -The result is: - -```sql -MySQL> select * from test_table; -+---------+------------+----------+-------------+---------+ -| user_id | date | group_id | modify_date | keyword | -+---------+------------+----------+-------------+---------+ -| 1 | 2020-02-22 | 1 | 2020-03-05 | c | -+---------+------------+----------+-------------+---------+ -``` - -In the data load, because the value of the sequence column (i.e., modify_date) '2020-03-05' is the maximum, the keyword column retains the value 'c'. - -**3. Guarantee the order of replacement** +**3. Ensuring Replacement Order** After completing the above steps, load the following data: @@ -249,7 +217,7 @@ After completing the above steps, load the following data: 1 2020-02-22 1 2020-02-23 b ``` -Query the data: +Query data: ```sql MySQL [test]> select * from test_table; @@ -260,16 +228,16 @@ MySQL [test]> select * from test_table; +---------+------------+----------+-------------+---------+ ``` -In the loaded data, the sequence column (modify_date) of all previously loaded data is compared, and '2020-03-05' is the maximum. Therefore, the keyword column retains the value 'c'. +In this load, the sequence column value '2020-03-05' is the largest, so the keyword column retains 'c'. -**4. Try loading the following data again** +**4. Try Loading the Following Data** ```Plain 1 2020-02-22 1 2020-02-22 a 1 2020-02-22 1 2020-03-23 w ``` -Query the data: +Query data: ```sql MySQL [test]> select * from test_table; @@ -280,16 +248,16 @@ MySQL [test]> select * from test_table; +---------+------------+----------+-------------+---------+ ``` -Now the original data in the table can be replaced. In summary, during thestream load process, the sequence column is used to determine the order of replacement for duplicate records. The record with the maximum value in the sequence column will be retained in the table. +This time, the data in the table is replaced. In summary, during the load process, the sequence column values of all batches are compared, and the record with the largest value is loaded into the Doris table. ### Note -1. To prevent misuse, users must explicitly specify the sequence column in loading tasks such as StreamLoad/BrokerLoad and in insert statements for row updates (unless the default value of the sequence column is CURRENT_TIMESTAMP). Otherwise, the following error message will be received: +1. To prevent misuse, in StreamLoad/BrokerLoad load tasks and row update insert statements, users must explicitly specify the sequence column (unless the default value of the sequence column is CURRENT_TIMESTAMP), otherwise, the following error message will be received: ```Plain Table test_tbl has sequence column, need to specify the sequence column ``` -2. Since version 2.0, Doris has supported partial column updates for Merge-on-Write implementation of Unique Key tables. In partial column update, users can update only a subset of columns each time, so it is not necessary to include the sequence column. If the loading task submitted by the user includes the sequence column, it has no effect. If the loading task submitted by the user does not include the sequence column, Doris will use the value of the matching sequence column from the h [...] +2. Since version 2.0, Doris supports partial column update capability for Unique Key tables with Merge-on-Write implementation. In partial column update loads, users can update only a portion of the columns each time, so it is not necessary to include the sequence column. If the load task submitted by the user includes the sequence column, the behavior is unaffected; if the load task does not include the sequence column, Doris will use the sequence column from the matching historical dat [...] -3. In cases of concurrent data load, Doris utilizes MVCC (Multi-Version Concurrency Control) mechanism to ensure data correctness. If two batches of loaded data update different columns of the same key, the load task with a higher system version will reapply the data for the same key written by the load task with a lower version after the lower version load task succeeds. +3. During concurrent loads, Doris uses the MVCC mechanism to ensure data correctness. If two batches of data loads update different columns of the same key, the load task with the higher system version will use the data row written by the lower version load task to fill in the same key after the lower version load task succeeds. --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org