This is an automated email from the ASF dual-hosted git repository. dataroaring pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 27fc5805fed merge load-atomicity and transaction. (#933) 27fc5805fed is described below commit 27fc5805fedca63500f1c4253d46813740803adb Author: Yongqiang YANG <98214048+dataroar...@users.noreply.github.com> AuthorDate: Thu Aug 1 10:02:52 2024 +0800 merge load-atomicity and transaction. (#933) --- docs/data-operate/import/load-atomicity.md | 233 ---------------- docs/data-operate/import/load-manual.md | 2 + docs/data-operate/transaction.md | 143 +++++++++- .../current/data-operate/import/load-atomicity.md | 295 --------------------- .../current/data-operate/import/load-manual.md | 2 + .../current/data-operate/transaction.md | 145 +++++++++- sidebars.json | 1 - 7 files changed, 276 insertions(+), 545 deletions(-) diff --git a/docs/data-operate/import/load-atomicity.md b/docs/data-operate/import/load-atomicity.md deleted file mode 100644 index 90857dae73c..00000000000 --- a/docs/data-operate/import/load-atomicity.md +++ /dev/null @@ -1,233 +0,0 @@ ---- -{ - "title": "Loading Transaction and Atomicity", - "language": "en" -} ---- - -<!-- -Licensed to the Apache Software Foundation (ASF) under one -or more contributor license agreements. See the NOTICE file -distributed with this work for additional information -regarding copyright ownership. The ASF licenses this file -to you under the Apache License, Version 2.0 (the -"License"); you may not use this file except in compliance -with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, -software distributed under the License is distributed on an -"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -KIND, either express or implied. See the License for the -specific language governing permissions and limitations -under the License. ---> - -## Use Cases - -All loading tasks in Doris are atomic, which means that a loading job either succeeds completely or fails completely. There won't be a situation where only part of the data is load successfully. Additionally, Doris ensures atomicity for loading multiple tables within the same loading task. Moreover, Doris ensures data loading without loss or duplication through the use of labels. For simple loading tasks, no additional configuration or operations are required. For materialized views asso [...] - -1. If a user needs to combine multiple `INSERT INTO` into a single transaction for the same target table, they can use the `BEGIN` and `COMMIT` commands. - -2. If a user needs to combine multiple stream load into a single transaction, they can use the two-phase commit mode of Stream Load. - -3. Atomicity of multi-table load with Broker Load. - -## Basic Principles - -In a Doris loading task, the Backend (BE) submits the Tablet IDs of the successful writes to the Frontend (FE). FE determines the success of the load based on the number of successful tablet replicas. If loading is successful, the transaction is committed and the data load becomes visible. If it fails, the transaction is rolled back and the corresponding tablets are cleaned up. - -### Label Mechanism - -Doris loading jobs can be assigned a label. This label is usually a user-defined string with certain business logic attributes. - -The main purpose of the label is to uniquely identify a loading task and ensure that the same label is loaded successfully only once. - -The label mechanism ensures that load data is not lost or duplicated. When combined with an upstream data source that guarantees At-Least-Once semantics, the label mechanism in Doris can ensure Exactly-Once semantics. - -Labels are unique within a database. The default retention period for labels is 3 days. After 3 days, completed labels are automatically cleaned up and can be reused. - -## Quick Start - -### Insert Into - -**1. Create Table** - -```sql -CREATE TABLE testdb.test_table( - user_id BIGINT NOT NULL COMMENT "User ID", - name VARCHAR(20) NOT NULL COMMENT "User Name", - age INT COMMENT "User Age" -) -DUPLICATE KEY(user_id) -DISTRIBUTED BY HASH(user_id) BUCKETS 10; -``` - -Create a table with the same schema for the failed example: - -```sql -CREATE TABLE testdb.test_table2 LIKE testdb.test_table; -``` - -**2. Successful Load Example** - -```sql -BEGIN; - --- INSERT #1 -INSERT INTO testdb.test_table (user_id, name, age) -VALUES (1, "Emily", 25), - (2, "Benjamin", 35), - (3, "Olivia", 28), - (4, "Alexander", 60), - (5, "Ava", 17); - --- INSERT #2 -INSERT INTO testdb.test_table (user_id, name, age) -VALUES (6, "William", 69), - (7, "Sophia", 32), - (8, "James", 64), - (9, "Emma", 37), - (10, "Liam", 64); - -COMMIT; -``` - -Load result: The loading task starts with the `PREPARE` status and becomes `VISIBLE` only after the COMMIT. - -```JSON -// BEGIN -Query OK, 0 rows affected (0.001 sec) -{'label':'txn_insert_2aeac5519bd549a1-a72fe4001c56e10c', 'status':'PREPARE', 'txnId':''} - -// INSERT #1 -Query OK, 5 rows affected (0.017 sec) -{'label':'txn_insert_2aeac5519bd549a1-a72fe4001c56e10c', 'status':'PREPARE', 'txnId':'10060'} - -// INSERT #2 -Query OK, 5 rows affected (0.007 sec) -{'label':'txn_insert_2aeac5519bd549a1-a72fe4001c56e10c', 'status':'PREPARE', 'txnId':'10060'} - -// COMMIT -Query OK, 0 rows affected (1.013 sec) -{'label':'txn_insert_2aeac5519bd549a1-a72fe4001c56e10c', 'status':'VISIBLE', 'txnId':'10060'} -``` - -Verify the data: - -```JSON -MySQL [testdb]> SELECT * FROM testdb.test_table2; -Empty set (0.019 sec) -``` - - -### Stream Load - -**1. Enable two-phase commit by setting `two_phase_commit:true` in the HTTP Header.** - -```Bash -curl --location-trusted -u user:passwd -H "two_phase_commit:true" -T test.txt http://fe_host:http_port/api/{db}/{table}/_stream_load -{ - "TxnId": 18036, - "Label": "55c8ffc9-1c40-4d51-b75e-f2265b3602ef", - "TwoPhaseCommit": "true", - "Status": "Success", - "Message": "OK", - "NumberTotalRows": 100, - "NumberLoadedRows": 100, - "NumberFilteredRows": 0, - "NumberUnselectedRows": 0, - "LoadBytes": 1031, - "LoadTimeMs": 77, - "BeginTxnTimeMs": 1, - "StreamLoadPutTimeMs": 1, - "ReadDataTimeMs": 0, - "WriteDataTimeMs": 58, - "CommitAndPublishTimeMs": 0 -} -``` - -**2. Trigger the commit operation for a transaction (can be sent to FE or BE).** - -- Specify the transaction using the Transaction ID: - - ```Bash - curl -X PUT --location-trusted -u user:passwd -H "txn_id:18036" -H "txn_operation:commit" http://fe_host:http_port/api/{db}/{table}/stream_load2pc - { - "status": "Success", - "msg": "transaction [18036] commit successfully." - } - ``` - -- Specify the transaction using the label: - - ```Bash - curl -X PUT --location-trusted -u user:passwd -H "label:55c8ffc9-1c40-4d51-b75e-f2265b3602ef" -H "txn_operation:commit" http://fe_host:http_port/api/{db}/{table}/_stream_load_2pc - { - "status": "Success", - "msg": "label [55c8ffc9-1c40-4d51-b75e-f2265b3602ef] commit successfully." - } - ``` - -**3. Trigger the abort operation for a transaction (can be sent to FE or BE).** - -- Specify the transaction using the Transaction ID: - - ```Bash - curl -X PUT --location-trusted -u user:passwd -H "txn_id:18037" -H "txn_operation:abort" http://fe_host:http_port/api/{db}/{table}/stream_load2pc - { - "status": "Success", - "msg": "transaction [18037] abort successfully." - } - ``` - -- Specify the transaction using the label: - - ```Bash - curl -X PUT --location-trusted -u user:passwd -H "label:55c8ffc9-1c40-4d51-b75e-f2265b3602ef" -H "txn_operation:abort" http://fe_host:http_port/api/{db}/{table}/stream_load2pc - { - "status": "Success", - "msg": "label [55c8ffc9-1c40-4d51-b75e-f2265b3602ef] abort successfully." - } - ``` - -### Broker Load - -All Broker Load tasks are atomic and ensure atomicity even when loading multiple tables within the same task. The Label mechanism can be used to ensure data load without loss or duplication. - -The following example demonstrates loading data from HDFS by using wildcard patterns to match two sets of files and load them into two different tables. - -```sql -LOAD LABEL example_db.label2 -( - DATA INFILE("hdfs://hdfs_host:hdfs_port/input/file-10*") - INTO TABLE `my_table1` - PARTITION (p1) - COLUMNS TERMINATED BY "," - (k1, tmp_k2, tmp_k3) - SET ( - k2 = tmp_k2 + 1, - k3 = tmp_k3 + 1 - ) - DATA INFILE("hdfs://hdfs_host:hdfs_port/input/file-20*") - INTO TABLE `my_table2` - COLUMNS TERMINATED BY "," - (k1, k2, k3) -) -WITH BROKER hdfs -( - "username"="hdfs_user", - "password"="hdfs_password" -); -``` - -The wildcard pattern is used to match and load two sets of files, `file-10*` and `file-20*`, into `my_table1` and `my_table2` respectively. In the case of `my_table1`, the load is specified to the `p1` partition, and the values of thesecond and third columns in the source file are incremented by 1 before being loaded. - -## Best Practices - -Labels are typically set in the format of `business_logic+time`, such as `my_business1_20220330_125000`. - -This label is commonly used to represent a batch of data generated by the business logic `my_business1` at `2022-03-30 12:50:00`. By setting this label, the business can query the loading task status using the label to determine whether the data batch at that specific time has been successfully load. If the load fails, the label can be used to retry the load. - -INSERT INTO supports loading the result of a Doris query into another table. INSERT INTO is a synchronous load method that returns the load result after execution. The success or failure of the load can be determined based on the response. INSERT INTO ensures the atomicity of the loading task, either all succeed or all fail load. \ No newline at end of file diff --git a/docs/data-operate/import/load-manual.md b/docs/data-operate/import/load-manual.md index edb1cc2b011..ead11af8463 100644 --- a/docs/data-operate/import/load-manual.md +++ b/docs/data-operate/import/load-manual.md @@ -87,6 +87,8 @@ This section mainly introduces some concepts related to import to help users bet All import tasks in Doris are atomic, meaning that a import job either succeeds completely or fails completely. Partially successful data import will not occur within the same import task, and atomicity and consistency between materialized views and base tables are also guaranteed. For simple import tasks, users do not need to perform additional configurations or operations. For materialized views associated with tables, atomicity and consistency with the base table are also guaranteed. +More detailed info refer to [Transaction](../../data-operate/transaction.md). + ### Label Mechanism Import jobs in Doris can be assigned a label. This label is usually a user-defined string with certain business logic properties. If not specified by the user, the system will generate one automatically. The main purpose of the label is to uniquely identify an import task and ensure that the same label is imported successfully only once. diff --git a/docs/data-operate/transaction.md b/docs/data-operate/transaction.md index 417fe46f9f7..fae6e70d07d 100644 --- a/docs/data-operate/transaction.md +++ b/docs/data-operate/transaction.md @@ -26,13 +26,15 @@ under the License. A transaction is an operation that contains one or more SQL statements. The execution of these statements must either be completely successful or completely fail. It is an indivisible work unit. +## Introduction + +Queries and DDL single statements are implicit transactions and are not supported within multi-statement transactions. Each individual write is an implicit transaction by default, and multiple writes can form an explicit transaction. Currently, Doris does not support nested transactions. + ## Explicit and Implicit Transactions ### Explicit Transactions -Explicit transactions require users to actively start, commit, or roll back transactions. Doris provides two types of explicit transactions: - -1. The transaction write method introduced in this document : +Explicit transactions require users to actively start, commit, or roll back transactions. Currently, DDL and query statements are not supported. ```sql BEGIN; @@ -40,13 +42,38 @@ Explicit transactions require users to actively start, commit, or roll back tran COMMIT; / ROLLBACK; ``` -2. [Stream Load 2PC](import/load-atomicity.md#stream-load) - ### Implicit Transactions Implicit transactions refer to SQL statements that are executed without explicitly adding statements to start and commit transactions before and after the statements. -In Doris, except for [Group Commit](import/group-commit-manual.md), each import statement opens a transaction when it starts executing. The transaction is automatically committed after the statement is executed, or automatically rolled back if the statement fails. For more information, see [Transaction Load](import/load-atomicity.md). +In Doris, except for [Group Commit](import/import-way/group-commit-manual.md), each import statement opens a transaction when it starts executing. The transaction is automatically committed after the statement is executed, or automatically rolled back if the statement fails. Each query or DDL statement is also an implicit transaction. + +### Isolation Level + +The only isolation level currently supported by Doris is READ COMMITTED. Under the READ COMMITTED isolation level, a statement sees only data that was committed before the statement began execution. It does not see uncommitted data. + +When a single statement is executed, it captures a snapshot of the tables involved at the start of the statement, meaning that a single statement can only see commits from other transactions made before it began execution. Other transactions' commits are not visible during the execution of a single statement. + +When a statement is executed inside a multi-statement transaction: + +* It sees only data that was committed before the statement began execution. If another transaction commits between the execution of the first and the second statements, two successive statements in the same transaction may see different data. +* Currently, it cannot see changes made by previous statements within the same transaction. + +### No Duplicates, No Loss + +Doris supports mechanisms to ensure no duplicates and no loss during data writes. The Label mechanism ensures no duplicates within a single transaction, while two-phase commit coordinates to prevent duplicates across multiple transactions. + +#### Label Mechanism + +Transactions or writes in Doris can be assigned a Label. This Label is typically a user-defined string with some business logic attributes. If not set, a UUID string will be generated internally. The main purpose of a Label is to uniquely identify a transaction or import task and ensure that a transaction or import with the same Label will only execute successfully once. The Label mechanism ensures that data imports are neither lost nor duplicated. If the upstream data source guarantees [...] + +Doris will clean up Labels based on time and number. By default, if the number of Labels exceeds 2000, cleanup will be triggered. Labels older than three days will also be cleaned up by default. Once a Label is cleaned up, a Label with the same name can execute successfully again, meaning it no longer has deduplication semantics. + +Labels are usually set in the format of `business_logic+timestamp`, such as `my_business1_20220330_125000`. This Label typically represents a batch of data generated by the business `my_business1` at `2022-03-30 12:50:00`. By setting Labels this way, the business can query the import task status using the Label to clearly determine whether the batch of data at that time has been successfully imported. If not, the import can be retried using the same Label. + +#### StreamLoad 2PC + +[StreamLoad 2PC](#stream-load) is mainly used to support exactly-once semantics (EOS) when writing to Doris with Flink. ## Transaction Operations @@ -78,7 +105,7 @@ Used to roll back all modifications made in the current transaction. Transactions are session-level, so if a session is terminated or closed, the transaction will automatically be rolled back. -## Transaction Load +## Transaction with multiple sql statements Currently, Doris supports two ways of transaction loading. @@ -378,3 +405,105 @@ mysql> SELECT * FROM dt3; * When using JDBC to connect to Doris for transaction operations, please add `useLocalSessionState=true` in the JDBC URL; otherwise, you may encounter the error `This is in a transaction, only insert, update, delete, commit, rollback is acceptable`. * In cloud mode, transaction load does not support `merge on write` unique tables, otherwise, you will encounter the error `Transaction load is not supported for merge on write unique keys table in cloud mode`. + +## Stream Load 2PC + +**1. Enable two-phase commit by setting `two_phase_commit:true` in the HTTP Header.** + +```Bash +curl --location-trusted -u user:passwd -H "two_phase_commit:true" -T test.txt http://fe_host:http_port/api/{db}/{table}/_stream_load +{ + "TxnId": 18036, + "Label": "55c8ffc9-1c40-4d51-b75e-f2265b3602ef", + "TwoPhaseCommit": "true", + "Status": "Success", + "Message": "OK", + "NumberTotalRows": 100, + "NumberLoadedRows": 100, + "NumberFilteredRows": 0, + "NumberUnselectedRows": 0, + "LoadBytes": 1031, + "LoadTimeMs": 77, + "BeginTxnTimeMs": 1, + "StreamLoadPutTimeMs": 1, + "ReadDataTimeMs": 0, + "WriteDataTimeMs": 58, + "CommitAndPublishTimeMs": 0 +} +``` + +**2. Trigger the commit operation for a transaction (can be sent to FE or BE).** + +- Specify the transaction using the Transaction ID: + + ```Bash + curl -X PUT --location-trusted -u user:passwd -H "txn_id:18036" -H "txn_operation:commit" http://fe_host:http_port/api/{db}/{table}/stream_load2pc + { + "status": "Success", + "msg": "transaction [18036] commit successfully." + } + ``` + +- Specify the transaction using the label: + + ```Bash + curl -X PUT --location-trusted -u user:passwd -H "label:55c8ffc9-1c40-4d51-b75e-f2265b3602ef" -H "txn_operation:commit" http://fe_host:http_port/api/{db}/{table}/_stream_load_2pc + { + "status": "Success", + "msg": "label [55c8ffc9-1c40-4d51-b75e-f2265b3602ef] commit successfully." + } + ``` + +**3. Trigger the abort operation for a transaction (can be sent to FE or BE).** + +- Specify the transaction using the Transaction ID: + + ```Bash + curl -X PUT --location-trusted -u user:passwd -H "txn_id:18037" -H "txn_operation:abort" http://fe_host:http_port/api/{db}/{table}/stream_load2pc + { + "status": "Success", + "msg": "transaction [18037] abort successfully." + } + ``` + +- Specify the transaction using the label: + + ```Bash + curl -X PUT --location-trusted -u user:passwd -H "label:55c8ffc9-1c40-4d51-b75e-f2265b3602ef" -H "txn_operation:abort" http://fe_host:http_port/api/{db}/{table}/stream_load2pc + { + "status": "Success", + "msg": "label [55c8ffc9-1c40-4d51-b75e-f2265b3602ef] abort successfully." + } + ``` + +## Broker Load into muti tables with a transaction + +All Broker Load tasks are atomic and ensure atomicity even when loading multiple tables within the same task. The Label mechanism can be used to ensure data load without loss or duplication. + +The following example demonstrates loading data from HDFS by using wildcard patterns to match two sets of files and load them into two different tables. + +```sql +LOAD LABEL example_db.label2 +( + DATA INFILE("hdfs://hdfs_host:hdfs_port/input/file-10*") + INTO TABLE `my_table1` + PARTITION (p1) + COLUMNS TERMINATED BY "," + (k1, tmp_k2, tmp_k3) + SET ( + k2 = tmp_k2 + 1, + k3 = tmp_k3 + 1 + ) + DATA INFILE("hdfs://hdfs_host:hdfs_port/input/file-20*") + INTO TABLE `my_table2` + COLUMNS TERMINATED BY "," + (k1, k2, k3) +) +WITH BROKER hdfs +( + "username"="hdfs_user", + "password"="hdfs_password" +); +``` + +The wildcard pattern is used to match and load two sets of files, `file-10*` and `file-20*`, into `my_table1` and `my_table2` respectively. In the case of `my_table1`, the load is specified to the `p1` partition, and the values of thesecond and third columns in the source file are incremented by 1 before being loaded. diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-atomicity.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-atomicity.md deleted file mode 100644 index 05bc57dd46a..00000000000 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-atomicity.md +++ /dev/null @@ -1,295 +0,0 @@ ---- -{ - "title": "导入事务与原子性", - "language": "zh-CN" -} ---- - -<!-- -Licensed to the Apache Software Foundation (ASF) under one -or more contributor license agreements. See the NOTICE file -distributed with this work for additional information -regarding copyright ownership. The ASF licenses this file -to you under the Apache License, Version 2.0 (the -"License"); you may not use this file except in compliance -with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, -software distributed under the License is distributed on an -"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -KIND, either express or implied. See the License for the -specific language governing permissions and limitations -under the License. ---> - -## 使用场景 - -Doris 中所有导入任务都是原子性的,即一个导入作业要么全部成功,要么全部失败,不会出现仅部分数据导入成功的情况,并且在同一个导入任务中对多张表的导入也能够保证原子性。同时,Doris 还可以通过 Label 的机制来保证数据导入的不丢不重。对于简单的导入任务,用户无需做额外配置或操作。对于表所附属的物化视图,也同时保证和基表的原子性和一致性。对于以下情形,Doris 为用户提供了更多的事务控制。 - -1. 如果用户需要将对于同一个目标表的多个 `INSERT INTO` 导入组合成一个事务,可以使用 `BEGIN` 和 `COMMIT` 命令。 - -2. 如果用户需要将多个 Stream Load 导入组合成一个事务,可以使用 Stream Load 的两阶段事务提交模式。 - -3. Broker Load 多表导入的原子性, - -## 基本原理 - -Doris 导入任务中,BE 会提交写入成功的 Tablet ID 到 FE。FE 会根据 tablet 成功副本数判断导入是否成功,如果成功,该导入的事务被 commit,导入数据可见。如果失败,该导入的事务会被 rollback,相应的 tablet 也会被清理。 - -### Label 机制 - -Doris 的导入作业都可以设置一个 Label。这个 Label 通常是用户自定义的、具有一定业务逻辑属性的字符串。 - -Label 的主要作用是唯一标识一个导入任务,并且能够保证相同的 Label 仅会被成功导入一次。 - -Label 机制可以保证导入数据的不丢不重。如果上游数据源能够保证 At-Least-Once 语义,则配合 Doris 的 Label 机制,能够保证 Exactly-Once 语义。 - -Label 在一个数据库下具有唯一性。Label 的保留期限默认是 3 天。即 3 天后,已完成的 Label 会被自动清理,之后 Label 可以被重复使用。 - -## 快速上手 - -### Insert Into - -**1. 建表** - -```sql -CREATE TABLE testdb.test_table( - user_id BIGINT NOT NULL COMMENT "用户 ID", - name VARCHAR(20) NOT NULL COMMENT "用户姓名", - age INT COMMENT "用户年龄" -) -DUPLICATE KEY(user_id) -DISTRIBUTED BY HASH(user_id) BUCKETS 10; -``` - -创建一个同样 Schema 的表用于失败的例子 - -```sql -CREATE TABLE testdb.test_table2 LIKE testdb.test_table; -``` - -**2. 导入成功的例子** - -```SQL -BEGIN; - --- INSERT #1 -INSERT INTO testdb.test_table (user_id, name, age) -VALUES (1, "Emily", 25), - (2, "Benjamin", 35), - (3, "Olivia", 28), - (4, "Alexander", 60), - (5, "Ava", 17); - --- INSERT #2 -INSERT INTO testdb.test_table (user_id, name, age) -VALUES (6, "William", 69), - (7, "Sophia", 32), - (8, "James", 64), - (9, "Emma", 37), - (10, "Liam", 64); - -COMMIT; -``` - -导入结果,导入任务的状态先是 `PREPARE`,直到 COMMIT 后才是 `VISIBLE`。 - -```json -// BEGIN -Query OK, 0 rows affected (0.001 sec) -{'label':'txn_insert_2aeac5519bd549a1-a72fe4001c56e10c', 'status':'PREPARE', 'txnId':''} - -// INSERT #1 -Query OK, 5 rows affected (0.017 sec) -{'label':'txn_insert_2aeac5519bd549a1-a72fe4001c56e10c', 'status':'PREPARE', 'txnId':'10060'} - -// INSERT #2 -Query OK, 5 rows affected (0.007 sec) -{'label':'txn_insert_2aeac5519bd549a1-a72fe4001c56e10c', 'status':'PREPARE', 'txnId':'10060'} - -// COMMIT -Query OK, 0 rows affected (1.013 sec) -{'label':'txn_insert_2aeac5519bd549a1-a72fe4001c56e10c', 'status':'VISIBLE', 'txnId':'10060'} -``` - -验证数据 - -```sql -MySQL [testdb]> SELECT * FROM testdb.test_table; -+---------+-----------+------+ -| user_id | name | age | -+---------+-----------+------+ -| 5 | Ava | 17 | -| 10 | Liam | 64 | -| 1 | Emily | 25 | -| 4 | Alexander | 60 | -| 7 | Sophia | 32 | -| 9 | Emma | 37 | -| 2 | Benjamin | 35 | -| 3 | Olivia | 28 | -| 6 | William | 69 | -| 8 | James | 64 | -+---------+-----------+------+ -10 rows in set (0.110 sec) -``` - -**3. 导入失败的例子** - -```sql -BEGIN; - --- INSERT #1 -INSERT INTO testdb.test_table2 (user_id, name, age) -VALUES (1, "Emily", 25), - (2, "Benjamin", 35), - (3, "Olivia", 28), - (4, "Alexander", 60), - (5, "Ava", 17); - --- INSERT #2 -INSERT INTO testdb.test_table2 (user_id, name, age) -VALUES (6, "William", 69), - (7, "Sophia", 32), - (8, NULL, 64), - (9, "Emma", 37), - (10, "Liam", 64); - -COMMIT; -``` - -导入结果,因为第二个 INSERT INTO 存在 NULL,导致整个事务 COMMIT 失败。 - -```JSON -// BEGIN -Query OK, 0 rows affected (0.001 sec) -{'label':'txn_insert_f3ecb2285edf42e2-92988ee97d74fbb0', 'status':'PREPARE', 'txnId':''} - -// INSERT #1 -Query OK, 5 rows affected (0.012 sec) -{'label':'txn_insert_f3ecb2285edf42e2-92988ee97d74fbb0', 'status':'PREPARE', 'txnId':'10062'} - -// INSERT #2 -{'label':'txn_insert_f3ecb2285edf42e2-92988ee97d74fbb0', 'status':'PREPARE', 'txnId':'10062'} - -// COMMIT -ERROR 1105 (HY000): errCode = 2, detailMessage = errCode = 2, detailMessage = [DATA_QUALITY_ERROR]too many filtered rows -``` - -验证结果,没有数据被导入。 - -```JSON -MySQL [testdb]> SELECT * FROM testdb.test_table2; -Empty set (0.019 sec) -``` - -### Stream Load - -**1. 在 HTTP Header 中设置 `two_phase_commit:true` 启用两阶段提交。** - -```Bash -curl --location-trusted -u user:passwd -H "two_phase_commit:true" -T test.txt http://fe_host:http_port/api/{db}/{table}/_stream_load -{ - "TxnId": 18036, - "Label": "55c8ffc9-1c40-4d51-b75e-f2265b3602ef", - "TwoPhaseCommit": "true", - "Status": "Success", - "Message": "OK", - "NumberTotalRows": 100, - "NumberLoadedRows": 100, - "NumberFilteredRows": 0, - "NumberUnselectedRows": 0, - "LoadBytes": 1031, - "LoadTimeMs": 77, - "BeginTxnTimeMs": 1, - "StreamLoadPutTimeMs": 1, - "ReadDataTimeMs": 0, - "WriteDataTimeMs": 58, - "CommitAndPublishTimeMs": 0 -} -``` - -**2. 对事务触发 commit 操作(请求发往 FE 或 BE 均可)** - -- 可以使用事务 id 指定事务 - - ```Bash - curl -X PUT --location-trusted -u user:passwd -H "txn_id:18036" -H "txn_operation:commit" http://fe_host:http_port/api/{db}/{table}/stream_load2pc - { - "status": "Success", - "msg": "transaction [18036] commit successfully." - } - ``` - -- 也可以使用 label 指定事务 - - ```Bash - curl -X PUT --location-trusted -u user:passwd -H "label:55c8ffc9-1c40-4d51-b75e-f2265b3602ef" -H "txn_operation:commit" http://fe_host:http_port/api/{db}/{table}/_stream_load_2pc - { - "status": "Success", - "msg": "label [55c8ffc9-1c40-4d51-b75e-f2265b3602ef] commit successfully." - } - ``` - -**3. 对事务触发 abort 操作(请求发往 FE 或 BE 均可)** - -- 可以使用事务 id 指定事务 - - ```Bash - curl -X PUT --location-trusted -u user:passwd -H "txn_id:18037" -H "txn_operation:abort" http://fe_host:http_port/api/{db}/{table}/stream_load2pc - { - "status": "Success", - "msg": "transaction [18037] abort successfully." - } - ``` - -- 也可以使用 label 指定事务 - - ```Bash - curl -X PUT --location-trusted -u user:passwd -H "label:55c8ffc9-1c40-4d51-b75e-f2265b3602ef" -H "txn_operation:abort" http://fe_host:http_port/api/{db}/{table}/stream_load2pc - { - "status": "Success", - "msg": "label [55c8ffc9-1c40-4d51-b75e-f2265b3602ef] abort successfully." - } - ``` - -### Broker Load - -所有 Broker Load 导入任务都是原子生效的。并且在同一个导入任务中对多张表的导入也能够保证原子性。还可以通过 Label 的机制来保证数据导入的不丢不重。 - -下面例子是从 HDFS 导入数据,使用通配符匹配两批文件,分别导入到两个表中。 - -```sql -LOAD LABEL example_db.label2 -( - DATA INFILE("hdfs://hdfs_host:hdfs_port/input/file-10*") - INTO TABLE `my_table1` - PARTITION (p1) - COLUMNS TERMINATED BY "," - (k1, tmp_k2, tmp_k3) - SET ( - k2 = tmp_k2 + 1, - k3 = tmp_k3 + 1 - ) - DATA INFILE("hdfs://hdfs_host:hdfs_port/input/file-20*") - INTO TABLE `my_table2` - COLUMNS TERMINATED BY "," - (k1, k2, k3) -) -WITH BROKER hdfs -( - "username"="hdfs_user", - "password"="hdfs_password" -); -``` - -使用通配符匹配导入两批文件 `file-10*` 和 `file-20*`。分别导入到 `my_table1` 和 `my_table2` 两张表中。其中 `my_table1` 指定导入到分区 `p1` 中,并且将导入源文件中第二列和第三列的值 +1 后导入。 - -## 最佳实践 - -Label 通常被设置为 `业务逻辑+时间` 的格式。如 `my_business1_20220330_125000`。 - -这个 Label 通常用于表示:业务 `my_business1` 这个业务在 `2022-03-30 12:50:00` 产生的一批数据。通过这种 Label 设定,业务上可以通过 Label 查询导入任务状态,来明确的获知该时间点批次的数据是否已经导入成功。如果没有成功,则可以使用这个 Label 继续重试导入。 - -INSERT INTO 支持将 Doris 查询的结果导入到另一个表中。INSERT INTO 是一个同步导入方式,执行导入后返回导入结果。可以通过请求的返回判断导入是否成功。INSERT INTO 可以保证导入任务的原子性,要么全部导入成功,要么全部导入失败。 \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md index 4bd5841c9c4..3d774c545bf 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/import/load-manual.md @@ -86,6 +86,8 @@ Doris支持的导入方式包括Stream Load、Broker Load、Insert Into、Routin Doris 中所有导入任务都是原子性的,即一个导入作业要么全部成功,要么全部失败,不会出现仅部分数据导入成功的情况,并且在同一个导入任务中对多张表的导入也能够保证原子性。对于简单的导入任务,用户无需做额外配置或操作。对于表所附属的物化视图,也同时保证和基表的原子性和一致性。 +更多详细信息参考[事务](../../data-operate/transaction.md)。 + ### 标签机制 Doris 的导入作业都可以设置一个 Label。这个 Label 通常是用户自定义的、具有一定业务逻辑属性的字符串,如果用户不指定,系统也会自动生成一个。Label 的主要作用是唯一标识一个导入任务,并且能够保证相同的 Label 仅会被成功导入一次。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/transaction.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/transaction.md index a5f1847de0d..94fa89aa750 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/transaction.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data-operate/transaction.md @@ -26,13 +26,13 @@ under the License. 事务是指一个操作,包含一个或多个SQL语句,这些语句的执行要么完全成功,要么完全失败,是一个不可分割的工作单位。 -## 显式事务和隐式事务 +## 概览 -### 显式事务 +查询和 DDL 单个语句是一个隐式事务,不支持多语句事务中包含查询和 DDL。每个单独的写入默认是一个隐式的事务,多个写入可以组成一个显式事务。目前 Doris 不支持嵌套事务。 -显式事务需要用户主动的开启,提交或回滚事务。 在 Doris 中,提供了 2 种显式事务: +### 显式事务 -1. 本文中介绍的事务写方式,即: +显式事务需要用户主动开启、提交或回滚事务,目前不支持 DDL 和查询语句。 ```sql BEGIN; @@ -40,15 +40,39 @@ under the License. COMMIT; / ROLLBACK; ``` -2. [Stream Load 2PC](import/load-atomicity.md#stream-load) - ### 隐式事务 隐式事务是指用户在所执行的一条或多条SQL语句的前后,没有显式添加开启事务和提交事务的语句。 -在 Doris 中,除[Group Commit](import/group-commit-manual.md)外,每个导入语句在开始执行时都会开启一个事务,并且在该语句执行完成之后,自动提交该事务;或执行失败后,自动回滚该事务。更多详细信息请参考: [导入事务与原子性](import/load-atomicity.md)。 +在 Doris 中,除[Group Commit](import/group-commit-manual.md)外,每个导入语句在开始执行时都会开启一个事务,并且在该语句执行完成之后,自动提交该事务;或执行失败后,自动回滚该事务。每个查询或者 DDL 也是一个隐藏事务。 + +### 隔离级别 +Doris 当前支持的唯一隔离级别是 READ COMMITTED。在 READ COMMITTED 隔离级别下,语句只能看到在该语句开始执行之前已经提交的数据,它不会看到未提交的数据。 + +单个语句执行时,会在语句的开始捕获涉及到表的快照,即单个语句只能看见开始执行前其它事务的提交,单个语句执行期间不可见其它事务的提交。 + +当一个语句在多语句事务中执行时: + +* 只能看到在该语句开始执行之前已经提交的数据。如果在执行第一个和第二个语句之间有另一个事务提交,那么同一事务中的两个连续语句可能会看到不同的数据。 +* 目前看不到在同一事务中之前语句所做的更改。 + +### 不重不丢 + +Doris 有两个机制支持写入的不重不丢,使用 Label 机制提供了单个事务的不重,使用两阶段提交提供了协调多事务不重的能力。 + +#### Label 机制 -## 事务操作 +Doris 的事务或者写入可以设置一个 Label。这个 Label 通常是用户自定义的、具有一定业务逻辑属性的字符串,不设置时内部会生成一个 UUID 字符串。Label 的主要作用是唯一标识一个事务或者导入任务,并且能够保证相同 Label 的事务或者导入仅会成功执行一次。Label 机制可以保证导入数据的不丢不重,如果上游数据源能够保证 At-Least-Once 语义,则配合 Doris 的 Label 机制,能够保证 Exactly-Once 语义。Label 在一个数据库下具有唯一性。 + +Doris 会根据时间和数目清理 Label,默认 Label 数目超过 2000 个就会触发淘汰,默认超过 3 天的 Label 也会被淘汰。Label 被淘汰后相同名称的 Label 可以再次执行成功,即不再具有去重语义。 + +Label 通常被设置为 `业务逻辑+时间` 的格式。如 `my_business1_20220330_125000`。这个 Label 通常用于表示:业务 `my_business1` 这个业务在 `2022-03-30 12:50:00` 产生的一批数据。通过这种 Label 设定,业务上可以通过 Label 查询导入任务状态,来明确的获知该时间点批次的数据是否已经导入成功。如果没有成功,则可以使用这个 Label 继续重试导入。 + +#### StreamLoad 2PC + +[StreamLoad 2PC](#stream-load),主要用于支持 Flink 写入 Doris 时的 EOS 语义。 + +## 显式事务操作 ### 开启事务 @@ -78,7 +102,7 @@ ROLLBACK; 事务是 Session 级别的,如果 Session 中止或关闭,也会自动回滚该事务。 -## 事务写入 +## 多条 SQL 语句写入 目前 Doris 中支持 2 种方式的事务写入。 @@ -378,3 +402,106 @@ mysql> SELECT * FROM dt3; * 当使用 JDBC 连接 Doris 进行事务操作时,请在 JDBC URL 中添加 `useLocalSessionState=true`,否则可能会遇到错误 `This is in a transaction, only insert, update, delete, commit, rollback is acceptable.` * 存算分离模式下,事务写不支持 Merge-on-Write 表,否则会遇到报错 `Transaction load is not supported for merge on write unique keys table in cloud mode` + + +## Stream Load 2PC + +**1. 在 HTTP Header 中设置 `two_phase_commit:true` 启用两阶段提交。** + +```Bash +curl --location-trusted -u user:passwd -H "two_phase_commit:true" -T test.txt http://fe_host:http_port/api/{db}/{table}/_stream_load +{ + "TxnId": 18036, + "Label": "55c8ffc9-1c40-4d51-b75e-f2265b3602ef", + "TwoPhaseCommit": "true", + "Status": "Success", + "Message": "OK", + "NumberTotalRows": 100, + "NumberLoadedRows": 100, + "NumberFilteredRows": 0, + "NumberUnselectedRows": 0, + "LoadBytes": 1031, + "LoadTimeMs": 77, + "BeginTxnTimeMs": 1, + "StreamLoadPutTimeMs": 1, + "ReadDataTimeMs": 0, + "WriteDataTimeMs": 58, + "CommitAndPublishTimeMs": 0 +} +``` + +**2. 对事务触发 commit 操作(请求发往 FE 或 BE 均可)** + +- 可以使用事务 id 指定事务 + + ```Bash + curl -X PUT --location-trusted -u user:passwd -H "txn_id:18036" -H "txn_operation:commit" http://fe_host:http_port/api/{db}/{table}/stream_load2pc + { + "status": "Success", + "msg": "transaction [18036] commit successfully." + } + ``` + +- 也可以使用 label 指定事务 + + ```Bash + curl -X PUT --location-trusted -u user:passwd -H "label:55c8ffc9-1c40-4d51-b75e-f2265b3602ef" -H "txn_operation:commit" http://fe_host:http_port/api/{db}/{table}/_stream_load_2pc + { + "status": "Success", + "msg": "label [55c8ffc9-1c40-4d51-b75e-f2265b3602ef] commit successfully." + } + ``` + +**3. 对事务触发 abort 操作(请求发往 FE 或 BE 均可)** + +- 可以使用事务 id 指定事务 + + ```Bash + curl -X PUT --location-trusted -u user:passwd -H "txn_id:18037" -H "txn_operation:abort" http://fe_host:http_port/api/{db}/{table}/stream_load2pc + { + "status": "Success", + "msg": "transaction [18037] abort successfully." + } + ``` + +- 也可以使用 label 指定事务 + + ```Bash + curl -X PUT --location-trusted -u user:passwd -H "label:55c8ffc9-1c40-4d51-b75e-f2265b3602ef" -H "txn_operation:abort" http://fe_host:http_port/api/{db}/{table}/stream_load2pc + { + "status": "Success", + "msg": "label [55c8ffc9-1c40-4d51-b75e-f2265b3602ef] abort successfully." + } + ``` + +## Broker Load 多表事务 + +所有 Broker Load 导入任务都是原子生效的。并且在同一个导入任务中对多张表的导入也能够保证原子性。还可以通过 Label 的机制来保证数据导入的不丢不重。 + +下面例子是从 HDFS 导入数据,使用通配符匹配两批文件,分别导入到两个表中。 + +```sql +LOAD LABEL example_db.label2 +( + DATA INFILE("hdfs://hdfs_host:hdfs_port/input/file-10*") + INTO TABLE `my_table1` + PARTITION (p1) + COLUMNS TERMINATED BY "," + (k1, tmp_k2, tmp_k3) + SET ( + k2 = tmp_k2 + 1, + k3 = tmp_k3 + 1 + ) + DATA INFILE("hdfs://hdfs_host:hdfs_port/input/file-20*") + INTO TABLE `my_table2` + COLUMNS TERMINATED BY "," + (k1, k2, k3) +) +WITH BROKER hdfs +( + "username"="hdfs_user", + "password"="hdfs_password" +); +``` + +使用通配符匹配导入两批文件 `file-10*` 和 `file-20*`。分别导入到 `my_table1` 和 `my_table2` 两张表中。其中 `my_table1` 指定导入到分区 `p1` 中,并且将导入源文件中第二列和第三列的值 +1 后导入。 diff --git a/sidebars.json b/sidebars.json index c5c9a9a3f46..75ab7302e3c 100644 --- a/sidebars.json +++ b/sidebars.json @@ -132,7 +132,6 @@ "data-operate/import/error-data-handling", "data-operate/import/load-data-convert", "data-operate/import/min-load-replica-num", - "data-operate/import/load-atomicity", "data-operate/import/migrate-data-from-other-olap" ] }, --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org