morningman opened a new issue #2663: [Proposal] Support partition overwrite operation URL: https://github.com/apache/incubator-doris/issues/2663 # Partition Overwrite Operation Some businesses need data backtracking, such as revising historical data or refreshing historical data. In the current version of Doris, there are only two ways to do this. One is to use the `REPLACE` aggregation model to overwrite by re-loading the data with same keys. The second way is to delete the data and then re-load the data. The first method can guarantee that the data is overwritten atomically, but the aggregation type is limited to `REPLACE` and can only overwrite data with the same keys. In the second method, there is a gap between deleting the data and successfully loading the new data. Therefore, we need a way that can overwrite data atomically, and without the limitations of the aggregation model. ## Two methods There are two methods to achieve this. 1. Temporary partition For partitions that need to be overwritten, users can first create a temporary partition and write data to this temporary partition. After the data is written, User can replace the original partition with this temporary partition by using an atomic replacement command. 2. Overwrite directly Doris's underlying data is managed by MVCC. Therefore, we can add a special version, named `overwrite version` in tablet's meta, to ensure that when data is read, all data before that version can be skipped to achieve the purpose of overwriting. Both methods have advantages and disadvantages. The **disadvantage** of the first method is that the operation is tedious. The user needs to perform the three steps of `create temporary partition-> load data-> replace partition`. And creating temporary partitions means creating new tablets. If there are a large number of overwrite operations, a large number of temporary tablets will be generated, which is not friendly to metadata management and garbage collection mechanisms. However, the **advantage** of this method is that users can load the data in multiple batches, and then overwrite and replace them at one time through the partition replacement operation. The **advantage** of the second method is to conform to the user's habits of overwriting write operations. Users can set any batch of load as a overwrite load. And there is no additional overhead. The **disadvantage** is that the user needs to load all the data that needs to be overwritten in a batch of load. If the amount of data to be overwritten is large, such as 100 GB or TB level, this method will lead to a higher probability of failure in the overwrite operation, and the retry cost is high. Both methods have their own advantages and disadvantages and applicable scenarios. So I plan to support both. ## Detail Design ### Temporary Partition The temporary partition method needs to consider the following: 1. How to make the load operation aware of the existence of the temporary partition? Or how to distinguish between temporary and formal partitions? 2. After the temporary partition is created, how to ensure that the formal partition will not be modified to ensure that the schema of the two partitions is exactly the same when the partition is eventually replaced? 3. How to perform partition replacement? First of all, using this method for overwrite operations is usually a specific scenario. Therefore, the design principle is to prohibit other operations, such as `ALTER TABLE`, to ensure that there are no changes to the table schema during the entire overwrite operation to reduce functional complexity. 1. Create and delete temporary partitions `ALTER TABLE tbl ADD TEMPORARY PARTITION (p1, p2, p3);` Creates a corresponding temporary partitions for the specified partitions. The name of the temporary partition is the name of the original partition prefixed with `__doris_tmp`. Tables with temporary partitions cannot be altered. If you delete the original partitions, the corresponding temporary partitions will also be deleted. `ALTER TABLE tbl DROP TEMPORARY PARTITION (p1, p2, p3);` Deletes the temporary partitions. 2. Load data to temporary partition You must explicitly specify temporary partitions to load data. * broker load ``` LOAD LABEL example_db.label1 ( DATA INFILE ("hdfs://hdfs_host:hdfs_port/user/palo/data/input/file") INTO TABLE `my_table` TEMPORARY PARTITION (p1, p2) ) WITH BROKER my_hdfs_broker ( "username" = "hdfs_user", "password" = "hdfs_passwd" ) PROPERTIES ( "timeout" = "3600", "max_filter_ratio" = "0.1" ); ``` * stream load ``` curl --location-trusted -u root -H "label: l123" -H "temporary partitions: p1, p2" .... ``` * insert into ``` INSERT INTO TABLE TEMPORARY PARTITION (p1, p2) SELECT ...; ``` 3. Swap temporary partitions `ALTER TABLE tbl SWAP TEMPORARY PARTITION (p1, p2)` Swap temporary and formal partitions. After the swapping, the user's query will query the exchanged partitions. Abd the temporary partition becomes a formal partition, and the formal partition becomes a temporary partition. After confirming that the data is correct, user can delete the partition by deleting the temporary partition. ### Overwrite directly The user can specify a batch of load as an **overwrite load**. The overwrite load will proceed as normal load. But in the "publish" phase, the publish task will record an `overwrite version` in the corresponding tablet meta. The version indicates that the data before this version will be directly overwritten. The publish task can record the `overwrite version` in the incremental rowset meta and update the tablet meta in memory. When doing the checkpoint of the tablet meta, update the `overwrite version` in the tablet meta. 1. Read operation The read operation will select the rowsets by specified `read version`. When the version of a rowset is less than `overwrite version`, and the `read version` is greater than or equal to `overwrite version`, the rowset will be skipped. If the `read version` is less than `overwrite version`, the` overwrite version` has no effect. `overwrite version` in tablet meta is actually a list. A set of `overwrite version` is recorded. The read operation will find the largest `overwrite version` that is smaller than the `read version` as **THE** overwrite version for this read. for example: ``` 0-10 11-13 14-14 15-15 16-16 overwrite version: 15 ``` Assuming the `read version` is 14, then the [0-14] version of data is read normally, and `overwrite version(15)` does not work. Assuming the `read version` is 16, we find the largest overwrite version smaller than 16 is 15, and only read the two versions 15, 16. The `overwrite version` list is updated during compaction. 2. Compaction Compaction operation first determines the range of versions that need to be merged, the **start version** and **end version**. Then look for the largest `overwrite version` smaller than the end version. Then during compaction, skip the rowset whose version is less than `overwrite version`. The following examples illustrate some typical cases * Case 1 ``` 0-10 11-13 14-14 15-15 16-16 overwrite version: 15 ``` If we want to merge the [14-16] version, the 14 version will be skipped, and the resulting rowset version will still be [14-16], but only contains data of the 15 and 16 versions. And the `overwrite version` will be changed from 15 to 14. If it is not changed, the `overwrite version` will be 15, but there is a rowset with the [14-16] version, which is weird. ``` 0-10 11-13 14-16 overwrite version: 14 ``` After that, the base compaction of versions 0 to 16 is continued, and the versions [0-10], [11-13] will be skipped, and eventually the data of version [0-16] will be generated, and the overwrite version will become 0. When overwrite version becomes 0, we can delete the overwrite version. * Case 2 ``` 0-10 11-13 14-14 15-15 16-16 overwrite version: 14, 16 ``` If we want to merge the [14-16] version, the 14 and 15 versions will be skipped, and the resulting rowset version is still [14-16], but only contains the data of the 16 version. And overwrite version 14 is removed and 16 becomes 14. ``` 0-10 11-13 14-16 overwrite version: 14 ``` ## Schedule I will implement the temporary partitioning method first, and then develop directly overwrite method.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org