morningman opened a new issue #2663: [Proposal] Support partition overwrite 
operation
URL: https://github.com/apache/incubator-doris/issues/2663
 
 
   # Partition Overwrite Operation
   
   Some businesses need data backtracking, such as revising historical data or 
refreshing historical data. In the current version of Doris, there are only two 
ways to do this. One is to use the `REPLACE` aggregation model to overwrite by 
re-loading the data with same keys. The second way is to delete the data and 
then re-load the data.
   
   The first method can guarantee that the data is overwritten atomically, but 
the aggregation type is limited to `REPLACE` and can only overwrite data with 
the same keys. In the second method, there is a gap between deleting the data 
and successfully loading the new data.
   
   Therefore, we need a way that can overwrite data atomically, and without the 
limitations of the aggregation model.
   
   ## Two methods
   
   There are two methods to achieve this.
   
   1. Temporary partition
   
       For partitions that need to be overwritten, users can first create a 
temporary partition and write data to this temporary partition. After the data 
is written, User can replace the original partition with this temporary 
partition by using an atomic replacement command.
   
   2. Overwrite directly
   
       Doris's underlying data is managed by MVCC. Therefore, we can add a 
special version, named `overwrite version` in tablet's meta, to ensure that 
when data is read, all data before that version can be skipped to achieve the 
purpose of overwriting.
   
   Both methods have advantages and disadvantages.
   
   The **disadvantage** of the first method is that the operation is tedious. 
The user needs to perform the three steps of `create temporary partition-> load 
data-> replace partition`. And creating temporary partitions means creating new 
tablets. If there are a large number of overwrite operations, a large number of 
temporary tablets will be generated, which is not friendly to metadata 
management and garbage collection mechanisms. However, the **advantage** of 
this method is that users can load the data in multiple batches, and then 
overwrite and replace them at one time through the partition replacement 
operation.
   
   The **advantage** of the second method is to conform to the user's habits of 
overwriting write operations. Users can set any batch of load as a overwrite 
load. And there is no additional overhead. The **disadvantage** is that the 
user needs to load all the data that needs to be overwritten in a batch of 
load. If the amount of data to be overwritten is large, such as 100 GB or TB 
level, this method will lead to a higher probability of failure in the 
overwrite operation, and the retry cost is high.
   
   Both methods have their own advantages and disadvantages and applicable 
scenarios. So I plan to support both.
   
   ## Detail Design
   
   ### Temporary Partition
   
   The temporary partition method needs to consider the following:
   
   1. How to make the load operation aware of the existence of the temporary 
partition? Or how to distinguish between temporary and formal partitions?
   2. After the temporary partition is created, how to ensure that the formal 
partition will not be modified to ensure that the schema of the two partitions 
is exactly the same when the partition is eventually replaced?
   3. How to perform partition replacement?
   
   First of all, using this method for overwrite operations is usually a 
specific scenario. Therefore, the design principle is to prohibit other 
operations, such as `ALTER TABLE`, to ensure that there are no changes to the 
table schema during the entire overwrite operation to reduce functional 
complexity.
   
   1. Create and delete temporary partitions
   
       `ALTER TABLE tbl ADD TEMPORARY PARTITION (p1, p2, p3);`
   
       Creates a corresponding temporary partitions for the specified 
partitions. The name of the temporary partition is the name of the original 
partition prefixed with `__doris_tmp`.
   
       Tables with temporary partitions cannot be altered. If you delete the 
original partitions, the corresponding temporary partitions will also be 
deleted.
   
       `ALTER TABLE tbl DROP TEMPORARY PARTITION (p1, p2, p3);`
   
       Deletes the temporary partitions.
   
   2. Load data to temporary partition
   
       You must explicitly specify temporary partitions to load data.
   
       * broker load
   
           ```
           LOAD LABEL example_db.label1
           (
           DATA INFILE ("hdfs://hdfs_host:hdfs_port/user/palo/data/input/file")
           INTO TABLE `my_table`
           TEMPORARY PARTITION (p1, p2)
           )
           WITH BROKER my_hdfs_broker
           (
           "username" = "hdfs_user",
           "password" = "hdfs_passwd"
           )
           PROPERTIES
           (
           "timeout" = "3600",
           "max_filter_ratio" = "0.1"
           );
           ```
   
       * stream load
   
           ```
           curl --location-trusted -u root -H "label: l123" -H "temporary 
partitions: p1, p2" ....
           ```
   
       * insert into
   
           ```
           INSERT INTO TABLE TEMPORARY PARTITION (p1, p2) SELECT ...;
           ```
   
   3. Swap temporary partitions
   
       `ALTER TABLE tbl SWAP TEMPORARY PARTITION (p1, p2)`
   
       Swap temporary and formal partitions. After the swapping, the user's 
query will query the exchanged partitions. Abd the temporary partition becomes 
a formal partition, and the formal partition becomes a temporary partition. 
After confirming that the data is correct, user can delete the partition by 
deleting the temporary partition.
   
   ### Overwrite directly
   
   The user can specify a batch of load as an **overwrite load**. The overwrite 
load will proceed as normal load. But in the "publish" phase, the publish task 
will record an `overwrite version` in the corresponding tablet meta. The 
version indicates that the data before this version will be directly 
overwritten.
   
   The publish task can record the `overwrite version` in the incremental 
rowset meta and update the tablet meta in memory. When doing the checkpoint of 
the tablet meta, update the `overwrite version` in the tablet meta.
   
   1. Read operation
   
       The read operation will select the rowsets by specified `read version`. 
When the version of a rowset is less than `overwrite version`, and the `read 
version` is greater than or equal to `overwrite version`, the rowset will be 
skipped. If the `read version` is less than `overwrite version`, the` overwrite 
version` has no effect.
   
       `overwrite version` in tablet meta is actually a list. A set of 
`overwrite version` is recorded. The read operation will find the largest 
`overwrite version` that is smaller than the `read version` as **THE** 
overwrite version for this read.
   
       for example:
   
       ```
       0-10
       11-13
       14-14
       15-15
       16-16
       
       overwrite version: 15
       ```
   
       Assuming the `read version` is 14, then the [0-14] version of data is 
read normally, and `overwrite version(15)` does not work.
   
       Assuming the `read version` is 16, we find the largest overwrite version 
smaller than 16 is 15, and only read the two versions 15, 16.
   
       The `overwrite version` list is updated during compaction.
       
   2. Compaction
   
       Compaction operation first determines the range of versions that need to 
be merged, the **start version** and **end version**. Then look for the largest 
`overwrite version` smaller than the end version. Then during compaction, skip 
the rowset whose version is less than `overwrite version`.
       
       The following examples illustrate some typical cases
       
       * Case 1
   
           ```
           0-10
           11-13
           14-14
           15-15
           16-16
           
           overwrite version: 15
           ```
   
           If we want to merge the [14-16] version, the 14 version will be 
skipped, and the resulting rowset version will still be [14-16], but only 
contains data of the 15 and 16 versions. And the `overwrite version` will be 
changed from 15 to 14. If it is not changed, the `overwrite version` will be 
15, but there is a rowset with the [14-16] version, which is weird.
   
           ```
           0-10
           11-13
           14-16
           
           overwrite version: 14
           ```
   
           After that, the base compaction of versions 0 to 16 is continued, 
and the versions [0-10], [11-13] will be skipped, and eventually the data of 
version [0-16] will be generated, and the overwrite version will become 0. When 
overwrite version becomes 0, we can delete the overwrite version.
   
       * Case 2
           
           ```
           0-10
           11-13
           14-14
           15-15
           16-16
           
           overwrite version: 14, 16
           ```
   
           If we want to merge the [14-16] version, the 14 and 15 versions will 
be skipped, and the resulting rowset version is still [14-16], but only 
contains the data of the 16 version. And overwrite version 14 is removed and 16 
becomes 14.
   
           ```
           0-10
           11-13
           14-16
           
           overwrite version: 14
           ```
   
   ## Schedule
   
   I will implement the temporary partitioning method first, and then develop 
directly overwrite method.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to